rtferraz
/

tucano2-commerce

Model card Files Files and versions

xet

Community

rtferraz commited on 14 days ago

Commit

b110818

verified ·

1 Parent(s): 734569e

Delete grpo_vertex_v3.md

Browse files

We're gonna use .ipynb version only

Files changed (1) hide show

grpo_vertex_v3.md +0 -1322

grpo_vertex_v3.md DELETED Viewed

@@ -1,1322 +0,0 @@
-# Tucano2 Commerce — GRPO Training v3 (Vertex AI Workbench / L4)
-**v3 changes over v2 — grounded in published research:**
-| Change | v2 Value | v3 Value | Paper Reference |
-|--------|----------|----------|----------------|
-| Temperature | 0.8 | **1.0** | Skywork-OR1 (2505.22312) §4: τ=1.0 gives 5-8% better results, delays entropy collapse |
-| Completion length | 2048 | **4096** | Dr. GRPO (2503.20783) §3.1: length bias inflates wrong answers → ceiling hit blocks learning |
-| Num generations | 8 | **4** | VRAM tradeoff: 4×4096 ≈ 8×2048. MC-GRPO (2601.22582): G=4 works with noise mitigation |
-| Learning rate | 5e-7 | **2e-6** | Dr. GRPO Appendix G: LR=1e-6; Reasoning-SQL: LR=1e-6. v2 clip_ratio=0 → room to push 2-4× |
-| β (KL penalty) | implicit | **0.0** | Dr. GRPO §3.2: β=0 optimal for rule-based rewards |
-| Training data | 300 | **ALL (~1400)** | Skywork-OR1 §3.1: small prompt sets → model memorizes → entropy collapse |
-| Reward functions | single composite | **staged (format→partial→task)** | Reasoning-SQL (2503.23157) §3.2: format rewards converge first, enable task learning |
-| Zero-advantage groups | included | **filtered with noise injection** | Skywork-OR1 §3.1: zero-std groups destabilize training |
-| Entropy monitoring | none | **EntropyMonitorCallback** | Skywork-OR1 §4: early detection prevents collapse |
-| Early stopping patience | 10 | **15** | More runway for longer completions |
-| Save total limit | 3 | **5** | Keep more checkpoints — v2 lost the best one |
-| Eval temperature | 0.7 | **0.1** | Deterministic eval = less noisy signal |
-| General reasoning mix | none | **30% (optional)** | Cocktail Effect (2410.01109): multi-task mix boosts domain performance 2-15% |
-**Prerequisites:**
-- Upload `data/pairs/train.jsonl` (2.1 MB) to `./data/pairs/`
-- Upload `models/tucano2-commerce-sft/` (126 MB) to `./models/tucano2-commerce-sft/`
-- **NEW:** Optional `data/pairs/general_reasoning.jsonl` for 30% general data mix
-**Hardware:** L4 (24GB), PyTorch kernel, bf16 supported
----
-## Cell 1: Dependencies
-Restart your kernel first (Kernel → Restart), then run these cells in order, one at a time:
-```python
-# Cell 1a — Nuke everything ML-related
-!pip uninstall -y torch torchvision torchaudio \
-    unsloth unsloth-zoo \
-    trl transformers peft accelerate \
-    bitsandbytes vllm vllm-flash-attn \
-    datasets tokenizers safetensors huggingface-hub \
-    wandb xformers triton \
-    cuda-bindings cuda-python \
-    sentencepiece protobuf \
-    2>/dev/null
-```
-```python
-# Cell 1b — Kill any stragglers
-!pip freeze | grep -iE "torch|unsloth|trl|vllm|bitsandbytes|transformers|peft|accelerate" | xargs pip uninstall -y 2>/dev/null
-```
-```python
-# Cell 1c — Purge cache
-!pip cache purge
-```
-**⚠️ Restart kernel again**, then:
-```python
-# Cell 1d — Clean install, correct order
-!pip install "unsloth"
-```
-```python
-# Cell 1e — Pin TRL (Unsloth may pull a different version)
-!pip install "trl==0.24.0" --no-deps
-```
-```python
-# Cell 1f — Extra deps
-!pip install "rich" "wandb"
-```
----
-## Cell 2: Hello World — GPU + Unsloth Verification
-```python
-import torch
-print(f"CUDA available: {torch.cuda.is_available()}")
-print(f"GPU: {torch.cuda.get_device_name(0)}")
-print(f"VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")
-print(f"bf16 support: {torch.cuda.is_bf16_supported()}")
-from unsloth import FastLanguageModel
-print("\n✓ Unsloth loaded successfully")
-import trl
-print(f"✓ TRL version: {trl.__version__}")
-import transformers
-print(f"✓ Transformers version: {transformers.__version__}")
-```
----
-## Cell 3: Config + Constants
-```python
-import os
-os.environ["UNSLOTH_COMPILE_DISABLE"] = "1"
-import json
-import re
-import time
-import random
-import gc
-from pathlib import Path
-# ══════════════════════════════════════════════════════════════════════════════
-# v3 CONFIG — Every change is annotated with paper reference
-# ══════════════════════════════════════════════════════════════════════════════
-MODEL_ID = "Polygl0t/Tucano2-qwen-3.7B-Think"
-MAX_SEQ_LENGTH = 8192  # v3: increased from 4096 — model supports 32k, we need room for 4096 completion + prompt
-# ── Paths ─────────────────────────────────────────────────────────────────────
-DATA_DIR = Path("/home/jupyter/tucano2/data")
-MODELS_DIR = Path("/home/jupyter/tucano2/models")
-SFT_ADAPTER_DIR = MODELS_DIR / "tucano2-commerce-sft"
-GRPO_ADAPTER_DIR = MODELS_DIR / "tucano2-commerce-grpo-v3"   # v3: separate dir from v2
-CHECKPOINT_DIR = GRPO_ADAPTER_DIR / "checkpoints"
-# ── Training data ─────────────────────────────────────────────────────────────
-GRPO_PROMPTS = None  # v3: None = use ALL available prompts (was 300 subset in v2)
-GENERAL_MIX_RATIO = 0.0  # v3: set to 0.3 if general_reasoning.jsonl exists (Cocktail Effect paper)
-# ── Valid enums for reward scoring (unchanged from v2) ────────────────────────
-VALID_SENTIMENTS = {"positive", "negative", "neutral"}
-VALID_CATEGORIES = {
-    "delivery_delay", "product_quality", "product_not_received",
-    "wrong_product", "seller_communication", "app_issue",
-    "price_value", "other", "none",
-}
-VALID_CHURN = {"low", "medium", "high"}
-VALID_REPEAT = {"yes", "no", "maybe"}
-EXTRACTION_FIELDS = [
-    "sentiment", "sentiment_score", "churn_risk", "delivery_issue",
-    "product_issue", "seller_issue", "main_complaint",
-    "complaint_category", "repeat_intent", "would_recommend",
-]
-SYSTEM_PT = (
-    "Você é um assistente de IA especializado em análise de e-commerce brasileiro. "
-    "Você compreende avaliações de clientes em português e padrões de comércio brasileiro."
-)
-# ══════════════════════════════════════════════════════════════════════════════
-# TRAINING HYPERPARAMETERS — v3 fixes (all changes annotated)
-# ══════════════════════════════════════════════════════════════════════════════
-# ── Core GRPO params ──────────────────────────────────────────────────────────
-BATCH_SIZE = 4
-GRAD_ACCUM = 1        # v3: reduced from 2. Effective batch = 4×1 = 4 (was 8)
-                      # With G=4: steps = prompts × 4 / 4 = prompts per epoch
-NUM_GENERATIONS = 4   # v3: reduced from 8 — VRAM tradeoff for longer completions
-                      # MC-GRPO (2601.22582): G=4 works if noise is mitigated
-SCALE_REWARDS = False # Dr. GRPO (2503.20783): remove std normalization bias
-# ── v3 CRITICAL FIXES ────────────────────────────────────────────────────────
-# FIX 1: Temperature — prevent entropy collapse
-# v2 had 0.8. All published GRPO papers use 1.0.
-# Skywork-OR1 (2505.22312) ablation: τ=1.0 vs τ=0.6 → 5-8% better test performance
-TEMPERATURE = 1.0
-# FIX 2: Completion length — remove the ceiling
-# v2: every single completion hit 2048 ceiling. Model couldn't finish reasoning.
-# Dr. GRPO (2503.20783) §3.1: GRPO length bias inflates wrong answers → ceiling kill gradient
-MAX_COMPLETION_LENGTH = 4096
-# FIX 3: Learning rate — more aggressive
-# v2: clip_ratio=0 on all steps → updates were too small to matter
-# Dr. GRPO Appendix G: LR=1e-6 (constant). Reasoning-SQL: LR=1e-6 with cosine.
-# We go 2× since v2 showed zero clipping (model can absorb stronger push)
-LEARNING_RATE = 2e-6
-# FIX 4: β = 0 (no KL penalty)
-# Dr. GRPO (2503.20783) §3.2: KL penalty is unnecessary for rule-based rewards
-# v2 used implicit KL through default β — we explicitly disable it
-BETA = 0.0
-# ── Training schedule ─────────────────────────────────────────────────────────
-NUM_EPOCHS = 1
-MAX_STEPS = 500       # v3: increased for expanded data; early stopping will halt if needed
-                      # With ~1400 prompts × 4 gen / (4 batch × 1 accum) = 1400 steps/epoch
-                      # MAX_STEPS=500 < 1 epoch — early stopping or manual extension
-# ── Checkpoint + Eval + Early-Stop ────────────────────────────────────────────
-EVAL_SPLIT_RATIO        = 0.15
-EVAL_STEPS              = 10
-EARLY_STOPPING_PATIENCE = 15    # v3: increased from 10 — gives 150 steps of runway
-EARLY_STOPPING_DELTA    = 0.005 # v3: reduced from 0.01 — more sensitive to small gains
-SAVE_STEPS              = 10    # v3: more frequent (was 15) — never lose best checkpoint again
-SAVE_TOTAL_LIMIT        = 5    # v3: keep more checkpoints (was 3 — lost best in v2)
-WANDB_PROJECT           = "tucano2-commerce"
-# ── Eval callback ─────────────────────────────────────────────────────────────
-EVAL_MAX_SAMPLES = 5
-EVAL_MAX_TOKENS  = 4096  # v3: match training max_completion_length (was 2048)
-EVAL_TEMPERATURE = 0.1   # v3: deterministic eval for less noisy signal (was 0.7)
-# ── Backend ───────────────────────────────────────────────────────────────────
-USE_VLLM = False
-# ── v3: Zero-advantage noise injection ────────────────────────────────────────
-# Skywork-OR1 (2505.22312) §3.1: zero-std groups destabilize GRPO training
-# When all G completions get identical rewards, the advantage is undefined.
-# Noise injection breaks ties without corrupting the signal.
-ZERO_ADV_NOISE_STD = 0.005  # Small gaussian noise added to zero-variance groups
-os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
-# ── Version assertion ─────────────────────────────────────────────────────────
-import trl as _trl
-assert _trl.__version__ == "0.24.0", (
-    f"UnslothGRPOTrainer was written for TRL 0.24.0, found {_trl.__version__}.\n"
-    "Verify that GRPOTrainer._generate() still exists before proceeding."
-)
-print("✓ v3 Config loaded")
-print(f"  SFT adapter: {SFT_ADAPTER_DIR} (exists: {SFT_ADAPTER_DIR.exists()})")
-print(f"  Train data: {DATA_DIR / 'pairs' / 'train.jsonl'} (exists: {(DATA_DIR / 'pairs' / 'train.jsonl').exists()})")
-print(f"  Training: batch={BATCH_SIZE}, grad_accum={GRAD_ACCUM}, eff_batch={BATCH_SIZE*GRAD_ACCUM}")
-print(f"  GRPO: G={NUM_GENERATIONS}, temp={TEMPERATURE}, LR={LEARNING_RATE}, β={BETA}")
-print(f"  Completion: max={MAX_COMPLETION_LENGTH} (v2 was 2048)")
-print(f"  ADR: save_steps={SAVE_STEPS}, eval_steps={EVAL_STEPS}, patience={EARLY_STOPPING_PATIENCE}")
-print(f"✓ TRL {_trl.__version__} verified")
-# ══════════════════════════════════════════════════════════════════════════════
-# v3 VRAM BUDGET (L4 24GB)
-# ══════════════════════════════════════════════════════════════════════════════
-# Model (NF4):          ~3.5 GB
-# KV Cache (8192 seq):  ~3.0 GB
-# Activations:          ~4.0 GB
-# Optimizer states:     ~3.0 GB
-# Generations (4×4096): ~8.0 GB
-# ─────────────────────────────────
-# Estimated total:      ~21.5 GB
-# Headroom:             ~2.5 GB
-#
-# If OOM: reduce MAX_COMPLETION_LENGTH to 3072 first, then 2560.
-# Do NOT reduce NUM_GENERATIONS below 4 — GRPO needs variance.
-# ══════════════════════════════════════════════════════════════════════════════
-```
----
-## Cell 4: Load SFT Adapter
-```python
-print("Loading SFT adapter...")
-model, tokenizer = FastLanguageModel.from_pretrained(
-    model_name=str(SFT_ADAPTER_DIR),
-    max_seq_length=MAX_SEQ_LENGTH,
-    load_in_4bit=True,
-    dtype=None,
-)
-if tokenizer.pad_token is None:
-    tokenizer.pad_token = tokenizer.eos_token
-# Load chat template from base model (SFT adapter doesn't save it)
-from transformers import AutoTokenizer
-base_tok = AutoTokenizer.from_pretrained(MODEL_ID)
-tokenizer.chat_template = base_tok.chat_template
-del base_tok
-# v2: Force KV cache — Unsloth patching may reset this
-model.config.use_cache = True
-model.generation_config.use_cache = True
-print(f"✓ Model loaded on {model.device}")
-print(f"  use_cache: {model.config.use_cache}")
-print(f"  Params: {sum(p.numel() for p in model.parameters()) / 1e6:.0f}M")
-print(f"  Chat template: {tokenizer.chat_template[:50]}...")
-```
----
-## Cell 5: Single Inference Test
-**Gate:** Does the model close `</think>` and produce an answer within 4096 tokens?
-```python
-FastLanguageModel.for_inference(model)
-test_msgs = [
-    {"role": "system", "content": SYSTEM_PT},
-    {"role": "user", "content": "Quais são as categorias de reclamação mais frequentes e como afetam a nota média?"},
-]
-text = tokenizer.apply_chat_template(test_msgs, tokenize=False, add_generation_prompt=True)
-inputs = tokenizer(text, return_tensors="pt").to(model.device)
-t0 = time.time()
-outputs = model.generate(**inputs, max_new_tokens=MAX_COMPLETION_LENGTH, temperature=0.7, do_sample=True)
-elapsed = time.time() - t0
-response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
-gen_tokens = outputs.shape[1] - inputs["input_ids"].shape[1]
-print(f"Generation time: {elapsed:.1f}s ({gen_tokens} tokens, {gen_tokens/elapsed:.1f} tok/s)")
-print(f"Response length: {len(response)} chars, {gen_tokens} tokens")
-print(f"Hit ceiling: {gen_tokens >= MAX_COMPLETION_LENGTH}")  # v3: should NOT hit ceiling with 4096
-print(f"closed_think: {'</think>' in response}")
-print(f"\n{'='*60}")
-print(response[:800])
-```
----
-## Cell 5b: KV Cache Diagnostic
-```python
-import time
-FastLanguageModel.for_inference(model)
-_kv_msgs = [{"role": "system", "content": SYSTEM_PT},
-            {"role": "user", "content": "Qual a categoria de reclamação mais frequente?"}]
-_kv_text   = tokenizer.apply_chat_template(_kv_msgs, tokenize=False, add_generation_prompt=True)
-_kv_inputs = tokenizer(_kv_text, return_tensors="pt").to(model.device)
-_token_times, _past, _generated = [], None, _kv_inputs["input_ids"]
-with torch.no_grad():
-    for _step in range(50):
-        _t0 = time.time()
-        seq_len = _generated.shape[1]
-        if _past is None:
-            _position_ids = torch.arange(seq_len, dtype=torch.long, device=model.device).unsqueeze(0)
-        else:
-            _position_ids = torch.tensor([[seq_len - 1]], dtype=torch.long, device=model.device)
-        _out = model(
-            input_ids=_generated[:, -1:] if _past else _generated,
-            position_ids=_position_ids,
-            attention_mask=torch.ones(1, seq_len, device=model.device),
-            past_key_values=_past,
-            use_cache=True,
-            return_dict=True,
-        )
-        _past = _out.past_key_values
-        _next = _out.logits[:, -1, :].argmax(dim=-1, keepdim=True)
-        _generated = torch.cat([_generated, _next], dim=1)
-        _token_times.append(time.time() - _t0)
-_ratio = sum(_token_times[45:]) / max(sum(_token_times[:5]), 1e-9)
-print(f"First 5 tok : {[f'{t*1000:.0f}ms' for t in _token_times[:5]]}")
-print(f"Last  5 tok : {[f'{t*1000:.0f}ms' for t in _token_times[45:]]}")
-print(f"Ratio last/first: {_ratio:.1f}x")
-if _ratio < 3:
-    print("✓ KV cache is working correctly")
-elif _ratio < 6:
-    print("⚠ KV cache may be degraded — check model.config.use_cache")
-else:
-    print("✗ KV cache BROKEN — GRPO generation will be catastrophically slow.")
-del _past, _generated, _kv_inputs, _token_times, _out
-gc.collect()
-if torch.cuda.is_available(): torch.cuda.empty_cache()
-```
----
-## Cell 6: Reward Functions v3
-**v3 changes:**
-- Staged reward design: format → partial content → full task (Reasoning-SQL, 2503.23157)
-- Zero-advantage noise injection (Skywork-OR1, 2505.22312)
-- Extraction reward redesigned for completion-length-friendly scoring
-```python
-def strip_think(text: str) -> str:
-    """Remove <think>...</think> block, return the answer portion."""
-    return re.sub(r"<think>.*?</think>", "", text, flags=re.DOTALL).strip()
-def has_think_block(text: str) -> bool:
-    """Check if text contains a non-empty <think> block."""
-    return bool(re.search(r"<think>.+</think>", text, flags=re.DOTALL))
-def _classify_task_type(prompt_text: str) -> str:
-    """Classify prompt into task type by keywords."""
-    p = prompt_text.lower()
-    if "retorne um objeto json" in p or "extraia dados" in p:
-        return "extraction"
-    elif "notificação push" in p or "notificação de reengajamento" in p:
-        return "push"
-    elif "perfil do cliente" in p:
-        return "insights"
-    else:
-        return "sql_qa"
-def _json_similarity(text: str) -> float:
-    """Rough heuristic: how JSON-like is this text? 0.0 to 1.0."""
-    text = text.strip()
-    if not text:
-        return 0.0
-    score = 0.0
-    if text.startswith("{") and text.endswith("}"):
-        score += 0.5
-    if '"' in text:
-        score += 0.2
-    if ":" in text:
-        score += 0.2
-    if "," in text:
-        score += 0.1
-    return min(score, 1.0)
-def _string_similarity(a: str, b: str) -> float:
-    """Simple Jaccard-like similarity for short strings. 0.0 to 1.0."""
-    if not a or not b:
-        return 0.0
-    a_set = set(a.split())
-    b_set = set(b.split())
-    intersection = len(a_set & b_set)
-    union = len(a_set | b_set)
-    return intersection / union if union > 0 else 0.0
-# ══════════════════════════════════════════════════════════════════════════════
-# v3 STAGED REWARD DESIGN
-# Reference: Reasoning-SQL (2503.23157) §3.2
-#
-# Each reward function scores THREE stages independently:
-#   Stage 1 — FORMAT (0.0–0.2): Is the output well-structured?
-#   Stage 2 — PARTIAL (0.0–0.3): Are some content elements correct?
-#   Stage 3 — TASK   (0.0–0.5): Is the full task completed correctly?
-#
-# Format rewards converge first (easy to learn), which stabilizes training
-# and enables the model to then learn harder task-specific skills.
-# ══════════════════════════════════════════════════════════════════════════════
-def reward_extraction(completion: str) -> float:
-    """Staged reward for structured extraction (max 1.0)."""
-    answer = strip_think(completion)
-    # ── Stage 1: FORMAT (max 0.2) ─────────────────────────────────────────────
-    r_format = 0.0
-    if has_think_block(completion):
-        r_format += 0.1  # Used reasoning
-    try:
-        data = json.loads(answer)
-        if isinstance(data, dict):
-            r_format += 0.1  # Valid JSON object
-    except (json.JSONDecodeError, TypeError):
-        r_format += 0.05 * _json_similarity(answer)
-        return min(r_format, 0.2)
-    if not isinstance(data, dict):
-        return min(r_format, 0.2)
-    # ── Stage 2: PARTIAL CONTENT (max 0.3) ────────────────────────────────────
-    r_partial = 0.0
-    present = sum(1 for f in EXTRACTION_FIELDS if f in data)
-    r_partial += 0.15 * (present / len(EXTRACTION_FIELDS))
-    type_checks = 0
-    type_total = 0
-    for field in EXTRACTION_FIELDS:
-        if field not in data:
-            continue
-        type_total += 1
-        val = data[field]
-        if field in ("delivery_issue", "product_issue", "seller_issue", "would_recommend"):
-            if isinstance(val, bool):
-                type_checks += 1
-        elif field in ("sentiment_score",):
-            if isinstance(val, (int, float)):
-                type_checks += 1
-        elif field in ("main_complaint", "sentiment", "complaint_category", "churn_risk", "repeat_intent"):
-            if isinstance(val, str):
-                type_checks += 1
-    if type_total > 0:
-        r_partial += 0.15 * (type_checks / type_total)
-    # ── Stage 3: FULL TASK (max 0.5) ─────────────────────────────────────────
-    r_task = 0.0
-    cat_checks = 0
-    cat_total = 0
-    checks = [
-        ("sentiment", lambda v: v in VALID_SENTIMENTS),
-        ("complaint_category", lambda v: v in VALID_CATEGORIES),
-        ("churn_risk", lambda v: v in VALID_CHURN),
-        ("repeat_intent", lambda v: v in VALID_REPEAT),
-        ("sentiment_score", lambda v: isinstance(v, (int, float)) and 1 <= v <= 5),
-    ]
-    for field, validator in checks:
-        cat_total += 1
-        if field in data and validator(data[field]):
-            cat_checks += 1
-    for bool_field in ("delivery_issue", "product_issue", "seller_issue", "would_recommend"):
-        cat_total += 1
-        if bool_field in data and isinstance(data[bool_field], bool):
-            cat_checks += 1
-    if cat_total > 0:
-        r_task += 0.35 * (cat_checks / cat_total)
-    if "main_complaint" in data and isinstance(data["main_complaint"], str):
-        complaint = data["main_complaint"].strip()
-        if len(complaint) > 10:
-            r_task += 0.15
-    return min(r_format + r_partial + r_task, 1.0)
-def reward_sql_qa(completion: str) -> float:
-    """Staged reward for SQL Q&A (max 1.0)."""
-    answer = strip_think(completion)
-    # ── Stage 1: FORMAT (max 0.2)
-    r_format = 0.0
-    if has_think_block(completion):
-        r_format += 0.1
-    if "```" in answer or re.search(r"SELECT|FROM", answer, re.IGNORECASE):
-        r_format += 0.1
-    # ── Stage 2: PARTIAL (max 0.3)
-    r_partial = 0.0
-    sql_keywords = r"SELECT|FROM|WHERE|GROUP BY|ORDER BY|COUNT|SUM|AVG|JOIN|HAVING"
-    matches = len(re.findall(sql_keywords, answer, re.IGNORECASE))
-    r_partial += min(0.15, 0.03 * matches)
-    numbers = re.findall(r"\d+(?:[.,]\d+)?", answer)
-    r_partial += min(0.15, 0.03 * len(numbers))
-    # ── Stage 3: TASK (max 0.5)
-    r_task = 0.0
-    length = len(answer)
-    if 50 <= length <= 600:
-        r_task += 0.25
-    elif length > 0:
-        r_task += 0.25 * max(0, 1 - abs(length - 325) / 275)
-    explanation_markers = ["para ", "porque", "resultado", "mostra", "indica", "análise"]
-    expl_matches = sum(1 for w in explanation_markers if w in answer.lower())
-    r_task += min(0.25, 0.05 * expl_matches)
-    return min(r_format + r_partial + r_task, 1.0)
-def reward_insights(completion: str) -> float:
-    """Staged reward for insights (max 1.0)."""
-    answer = strip_think(completion)
-    # ── Stage 1: FORMAT (max 0.2)
-    r_format = 0.0
-    if has_think_block(completion):
-        r_format += 0.1
-    structure_marks = len(re.findall(r"^[-•*]\s|^\d+[.)]\s|^#{1,3}\s", answer, re.MULTILINE))
-    r_format += min(0.1, 0.02 * structure_marks)
-    # ── Stage 2: PARTIAL (max 0.3)
-    r_partial = 0.0
-    length = len(answer)
-    if 100 <= length <= 1200:
-        r_partial += 0.15
-    elif length > 0:
-        r_partial += 0.15 * max(0, 1 - abs(length - 650) / 550)
-    pt_markers = re.findall(r"[ãçéêóúâõ]|você|para|como|seu|sua|cliente|produto", answer, re.IGNORECASE)
-    r_partial += min(0.15, 0.01 * len(pt_markers))
-    # ── Stage 3: TASK (max 0.5)
-    r_task = 0.0
-    action_words = ["recomend", "implement", "melhor", "reduzir", "aumentar",
-                    "priorizar", "investir", "otimizar", "estratégi", "suger",
-                    "consider", "ação", "plano"]
-    matches = sum(1 for w in action_words if w in answer.lower())
-    r_task += min(0.3, 0.06 * matches)
-    data_refs = len(re.findall(r"\d+%|R\$\s*\d|média|percentual|comparad|taxa", answer, re.IGNORECASE))
-    r_task += min(0.2, 0.04 * data_refs)
-    return min(r_format + r_partial + r_task, 1.0)
-def reward_push(completion: str) -> float:
-    """Staged reward for push notifications (max 1.0)."""
-    answer = strip_think(completion)
-    if not answer:
-        return 0.0
-    # ── Stage 1: FORMAT (max 0.2)
-    r_format = 0.0
-    if has_think_block(completion):
-        r_format += 0.05
-    length = len(answer)
-    if length <= 160:
-        r_format += 0.15
-    elif length <= 300:
-        r_format += 0.1
-    else:
-        r_format += 0.05
-    # ── Stage 2: PARTIAL (max 0.3)
-    r_partial = 0.0
-    pt_markers = re.findall(r"[ãçéêóúâõ]|você|para|como|seu|sua", answer, re.IGNORECASE)
-    r_partial += min(0.15, 0.02 * len(pt_markers))
-    if re.search(r"[!?]|[\U0001F600-\U0001F64F]|[\U0001F300-\U0001F5FF]", answer):
-        r_partial += 0.05
-    if len(answer.split()) >= 5:
-        r_partial += 0.1
-    # ── Stage 3: TASK (max 0.5)
-    r_task = 0.0
-    if length <= 120:
-        r_task += 0.25
-    else:
-        r_task += 0.25 * max(0, 1 - (length - 120) / 120)
-    generic_phrases = [
-        "olá! como podemos ajudar", "obrigado pela sua compra",
-        "seu pedido foi confirmado", "agradecemos sua preferência",
-    ]
-    max_similarity = max(_string_similarity(answer.lower(), g) for g in generic_phrases)
-    r_task += 0.25 * (1 - max_similarity)
-    return min(r_format + r_partial + r_task, 1.0)
-def commerce_reward_fn(completions, prompts, **kwargs) -> list[float]:
-    """
-    Master reward function v3: dispatches by task type + zero-advantage noise.
-    """
-    rewards = []
-    for completion, prompt in zip(completions, prompts):
-        if isinstance(completion, list):
-            comp_text = completion[-1]["content"] if completion else ""
-        else:
-            comp_text = str(completion)
-        if isinstance(prompt, list):
-            prompt_text = " ".join(m.get("content", "") for m in prompt)
-        else:
-            prompt_text = str(prompt)
-        task = _classify_task_type(prompt_text)
-        if task == "extraction":
-            rewards.append(reward_extraction(comp_text))
-        elif task == "sql_qa":
-            rewards.append(reward_sql_qa(comp_text))
-        elif task == "insights":
-            rewards.append(reward_insights(comp_text))
-        elif task == "push":
-            rewards.append(reward_push(comp_text))
-        else:
-            r = 0.15 if has_think_block(comp_text) else 0.0
-            r += 0.2 if comp_text.strip() else 0.0
-            rewards.append(r)
-    # ── v3: Zero-advantage noise injection ────────────────────────────────────
-    if ZERO_ADV_NOISE_STD > 0 and NUM_GENERATIONS > 1:
-        for i in range(0, len(rewards), NUM_GENERATIONS):
-            group = rewards[i:i+NUM_GENERATIONS]
-            if len(group) < 2:
-                continue
-            if max(group) - min(group) < 0.001:
-                for j in range(i, min(i+NUM_GENERATIONS, len(rewards))):
-                    rewards[j] += random.gauss(0, ZERO_ADV_NOISE_STD)
-    return rewards
-print("✓ v3 Reward functions defined (staged: format → partial → task)")
-```
----
-## Cell 7: Reward Calibration
-**Gate:** Verify reward variance > 0. Compare v3 scoring to v2 calibration (mean=0.38).
-```python
-train_path = DATA_DIR / "pairs" / "train.jsonl"
-by_type = {"extraction": [], "sql_qa": [], "insights": [], "push": []}
-with open(train_path) as f:
-    for line in f:
-        row = json.loads(line)
-        convs = row["conversations"]
-        prompt_msgs = [m for m in convs if m["role"] in ("system", "user")]
-        if not prompt_msgs:
-            continue
-        user_text = " ".join(m["content"] for m in prompt_msgs if m["role"] == "user")
-        task = _classify_task_type(user_text)
-        by_type[task].append(prompt_msgs)
-print(f"Prompts by type: {', '.join(f'{k}={len(v)}' for k, v in by_type.items())}")
-rng = random.Random(42)
-cal_samples = []
-for task_type in ["extraction", "extraction", "sql_qa", "sql_qa", "insights", "insights", "push", "push"]:
-    cal_samples.append(rng.choice(by_type[task_type]))
-FastLanguageModel.for_inference(model)
-print(f"\nReward calibration v3 ({len(cal_samples)} samples):")
-print("-" * 70)
-cal_rewards = []
-for i, msgs in enumerate(cal_samples):
-    text = tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
-    inputs = tokenizer(text, return_tensors="pt").to(model.device)
-    outputs = model.generate(**inputs, max_new_tokens=MAX_COMPLETION_LENGTH, temperature=0.7, do_sample=True)
-    response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
-    gen_tokens = outputs.shape[1] - inputs["input_ids"].shape[1]
-    r = commerce_reward_fn([response], [text])[0]
-    cal_rewards.append(r)
-    hit_ceiling = gen_tokens >= MAX_COMPLETION_LENGTH
-    has_answer = "</think>" in response
-    answer_preview = strip_think(response)[:100] if has_answer else "[stuck in <think>]"
-    task = _classify_task_type(text)
-    print(f"  [{task:12s}] reward={r:.2f} | tokens={gen_tokens:4d} | ceiling={'⚠️ HIT' if hit_ceiling else 'ok':6s} | {answer_preview}")
-print(f"\nMean={sum(cal_rewards)/len(cal_rewards):.2f}, Min={min(cal_rewards):.2f}, Max={max(cal_rewards):.2f}")
-print(f"v2 calibration was: Mean=0.38, Min=0.02, Max=0.70")
-print(f"Variance > 0: {len(set(cal_rewards)) > 1}")
-```
----
-## Cell 8: Dataset Preparation v3
-```python
-from datasets import Dataset
-def prepare_grpo_datasets_v3(n_prompts=GRPO_PROMPTS, eval_ratio=EVAL_SPLIT_RATIO,
-                              general_mix=GENERAL_MIX_RATIO, seed=42):
-    rng = random.Random(seed)
-    train_pools = {}
-    eval_records = []
-    for task, pool in by_type.items():
-        shuffled = pool.copy()
-        rng.shuffle(shuffled)
-        n_eval = max(1, int(len(shuffled) * eval_ratio))
-        eval_records.extend(shuffled[:n_eval])
-        train_pools[task] = shuffled[n_eval:]
-    if n_prompts is None:
-        train_records = []
-        for task, pool in train_pools.items():
-            train_records.extend(pool)
-        rng.shuffle(train_records)
-    else:
-        targets = {
-            "extraction": int(n_prompts * 0.4),
-            "sql_qa":     int(n_prompts * 0.4),
-            "insights":   int(n_prompts * 0.1),
-            "push":       int(n_prompts * 0.1),
-        }
-        train_records = []
-        for task, target_n in targets.items():
-            pool = train_pools[task]
-            n = min(target_n, len(pool))
-            train_records.extend(rng.sample(pool, n))
-        rng.shuffle(train_records)
-    general_path = DATA_DIR / "pairs" / "general_reasoning.jsonl"
-    if general_mix > 0 and general_path.exists():
-        general_records = []
-        with open(general_path) as f:
-            for line in f:
-                row = json.loads(line)
-                convs = row["conversations"]
-                prompt_msgs = [m for m in convs if m["role"] in ("system", "user")]
-                if prompt_msgs:
-                    general_records.append(prompt_msgs)
-        n_general = int(len(train_records) * general_mix / (1 - general_mix))
-        n_general = min(n_general, len(general_records))
-        if n_general > 0:
-            train_records.extend(rng.sample(general_records, n_general))
-            rng.shuffle(train_records)
-            print(f"  Cocktail Effect: added {n_general} general reasoning samples ({general_mix:.0%} mix)")
-    elif general_mix > 0:
-        print(f"  ⚠️ general_reasoning.jsonl not found — skipping mix")
-    task_dist = {}
-    for record in train_records:
-        user_text = " ".join(m["content"] for m in record if m["role"] == "user")
-        task = _classify_task_type(user_text)
-        task_dist[task] = task_dist.get(task, 0) + 1
-    n_domain = len(train_records)
-    steps_per_epoch = n_domain * NUM_GENERATIONS // (BATCH_SIZE * GRAD_ACCUM)
-    print(f"v3 Dataset split (eval_ratio={eval_ratio}):")
-    print(f"  train : {n_domain} prompts")
-    print(f"  eval  : {len(eval_records)} prompts")
-    print(f"  distribution: {', '.join(f'{k}={v}' for k, v in sorted(task_dist.items()))}")
-    print(f"  steps/epoch: {n_domain} × {NUM_GENERATIONS} / ({BATCH_SIZE} × {GRAD_ACCUM}) = {steps_per_epoch}")
-    print(f"  MAX_STEPS={MAX_STEPS} → {'< 1 epoch' if MAX_STEPS < steps_per_epoch else f'{MAX_STEPS/steps_per_epoch:.1f} epochs'}")
-    train_ds = Dataset.from_list([{"prompt": msgs} for msgs in train_records])
-    eval_ds  = Dataset.from_list([{"prompt": msgs} for msgs in eval_records])
-    return train_ds, eval_ds
-train_dataset, eval_dataset = prepare_grpo_datasets_v3()
-dataset = train_dataset
-print(f"\n✓ v3 Datasets ready: train={len(train_dataset)}, eval={len(eval_dataset)}")
-```
----
-## Cell 9: Smoke Test
-**Gate:** Runs 1 step without OOM at new completion length (4096).
-```python
-from trl import GRPOConfig, GRPOTrainer
-FastLanguageModel.for_training(model)
-smoke_config = GRPOConfig(
-    output_dir=str(CHECKPOINT_DIR / "smoke"),
-    num_generations=NUM_GENERATIONS,
-    scale_rewards=SCALE_REWARDS,
-    max_completion_length=MAX_COMPLETION_LENGTH,
-    max_steps=1,
-    num_train_epochs=1,
-    temperature=TEMPERATURE,
-    per_device_train_batch_size=BATCH_SIZE,
-    gradient_accumulation_steps=1,
-    learning_rate=LEARNING_RATE,
-    fp16=False,
-    bf16=True,
-    logging_steps=1,
-    save_steps=999,
-    report_to="none",
-    max_prompt_length=MAX_SEQ_LENGTH - MAX_COMPLETION_LENGTH,
-    seed=42,
-    remove_unused_columns=False,
-)
-smoke_trainer = GRPOTrainer(
-    model=model,
-    reward_funcs=commerce_reward_fn,
-    args=smoke_config,
-    train_dataset=dataset,
-    tokenizer=tokenizer,
-)
-t0 = time.time()
-smoke_trainer.train()
-step_time = time.time() - t0
-print(f"\n✓ Smoke test passed!")
-print(f"  Step time (grad_accum=1): {step_time:.0f}s")
-print(f"  Estimated step time (grad_accum={GRAD_ACCUM}): {step_time * GRAD_ACCUM:.0f}s")
-print(f"  VRAM peak: {torch.cuda.max_memory_allocated()/1e9:.1f} GB / {torch.cuda.get_device_properties(0).total_mem/1e9:.1f} GB")
-vram_used = torch.cuda.max_memory_allocated() / 1e9
-vram_total = torch.cuda.get_device_properties(0).total_mem / 1e9
-if vram_used > vram_total * 0.95:
-    print(f"\n⚠️  VRAM at {vram_used/vram_total:.0%} — dangerously close to OOM")
-    print(f"    Option 1: Reduce MAX_COMPLETION_LENGTH to 3072")
-    print(f"    Option 2: Reduce BATCH_SIZE to 2 (increase GRAD_ACCUM to 2)")
-del smoke_trainer
-gc.collect(); torch.cuda.empty_cache()
-```
----
-## Cell 10: Probe Run (3 steps)
-```python
-FastLanguageModel.for_training(model)
-probe_config = GRPOConfig(
-    output_dir=str(CHECKPOINT_DIR / "probe"),
-    num_generations=NUM_GENERATIONS,
-    scale_rewards=SCALE_REWARDS,
-    max_completion_length=MAX_COMPLETION_LENGTH,
-    max_steps=3,
-    temperature=TEMPERATURE,
-    num_train_epochs=NUM_EPOCHS,
-    per_device_train_batch_size=BATCH_SIZE,
-    gradient_accumulation_steps=GRAD_ACCUM,
-    learning_rate=LEARNING_RATE,
-    warmup_ratio=0.1,
-    lr_scheduler_type="cosine",
-    fp16=False,
-    bf16=True,
-    logging_steps=1,
-    disable_tqdm=True,
-    logging_first_step=True,
-    save_steps=999,
-    report_to="none",
-    max_prompt_length=MAX_SEQ_LENGTH - MAX_COMPLETION_LENGTH,
-    seed=42,
-    remove_unused_columns=False,
-)
-probe_trainer = GRPOTrainer(
-    model=model,
-    reward_funcs=commerce_reward_fn,
-    args=probe_config,
-    train_dataset=dataset,
-    tokenizer=tokenizer,
-)
-t0 = time.time()
-result = probe_trainer.train()
-elapsed = time.time() - t0
-print(f"\n✓ Probe complete in {elapsed:.0f}s ({elapsed/3:.0f}s/step)")
-print(f"  Train loss: {result.training_loss:.6f}")
-print(f"  Estimated full run ({MAX_STEPS} steps): {elapsed/3 * MAX_STEPS / 3600:.1f}h")
-if abs(result.training_loss) < 1e-6:
-    print("  ⚠️ Loss is near-zero — reward variance may be insufficient")
-else:
-    print("  ✓ Loss is non-zero — GRPO has gradient signal")
-del probe_trainer
-gc.collect(); torch.cuda.empty_cache()
-```
----
-## Cell 11: Full Training Run v3
-```python
-import wandb
-_wandb_key = os.environ.get("WANDB_API_KEY", "").strip()
-if not _wandb_key:
-    raise EnvironmentError("WANDB_API_KEY is not set.")
-wandb.login(key=_wandb_key, relogin=True)
-print(f"✓ W&B authenticated")
-```
-```python
-import shutil
-import torch
-from transformers import TrainerCallback
-from trl import GRPOConfig, GRPOTrainer
-wandb.init(
-    project=WANDB_PROJECT,
-    name=f"grpo-v3-l4-{time.strftime('%Y%m%d-%H%M')}",
-    config={
-        "model_id":               MODEL_ID,
-        "version":                "v3",
-        "temperature":            TEMPERATURE,
-        "max_completion_length":  MAX_COMPLETION_LENGTH,
-        "num_generations":        NUM_GENERATIONS,
-        "learning_rate":          LEARNING_RATE,
-        "beta":                   BETA,
-        "batch_size":             BATCH_SIZE,
-        "grad_accum":             GRAD_ACCUM,
-        "max_steps":              MAX_STEPS,
-        "scale_rewards":          SCALE_REWARDS,
-        "save_steps":             SAVE_STEPS,
-        "eval_steps":             EVAL_STEPS,
-        "eval_max_samples":       EVAL_MAX_SAMPLES,
-        "eval_max_tokens":        EVAL_MAX_TOKENS,
-        "eval_temperature":       EVAL_TEMPERATURE,
-        "patience":               EARLY_STOPPING_PATIENCE,
-        "delta":                  EARLY_STOPPING_DELTA,
-        "train_prompts":          len(train_dataset),
-        "eval_prompts":           len(eval_dataset),
-        "zero_adv_noise_std":     ZERO_ADV_NOISE_STD,
-        "general_mix_ratio":      GENERAL_MIX_RATIO,
-        "_ref_temperature":       "Skywork-OR1 (2505.22312)",
-        "_ref_completion_length": "Dr. GRPO (2503.20783)",
-        "_ref_staged_rewards":    "Reasoning-SQL (2503.23157)",
-        "_ref_zero_adv":          "Skywork-OR1 (2505.22312)",
-    },
-)
-print(f"✓ W&B run: {wandb.run.url}")
-FRESH = True
-resume_from = None
-if FRESH and CHECKPOINT_DIR.exists():
-    print("FRESH: deleting old checkpoints...")
-    shutil.rmtree(CHECKPOINT_DIR)
-elif CHECKPOINT_DIR.exists():
-    checkpoints = sorted(
-        [d for d in CHECKPOINT_DIR.iterdir()
-         if d.is_dir() and d.name.startswith("checkpoint-")],
-        key=lambda d: int(d.name.split("-")[-1]),
-    )
-    if checkpoints:
-        resume_from = str(checkpoints[-1])
-        print(f"Resuming from: {resume_from}")
-class UnslothGRPOTrainer(GRPOTrainer):
-    """Wraps generation with Unsloth for_inference()/for_training()."""
-    def _generate(self, prompts, images):
-        FastLanguageModel.for_inference(self.model)
-        try:
-            result = super()._generate(prompts, images)
-        finally:
-            FastLanguageModel.for_training(self.model)
-        return result
-class EvalRewardCallback(TrainerCallback):
-    """v3: deterministic eval, per-task breakdown, patience=15."""
-    def __init__(self, eval_records, reward_fn, patience=EARLY_STOPPING_PATIENCE,
-                 delta=EARLY_STOPPING_DELTA):
-        self.eval_records     = eval_records
-        self.reward_fn        = reward_fn
-        self.patience         = patience
-        self.delta            = delta
-        self.best_reward      = -float("inf")
-        self.no_improve_count = 0
-    def on_step_end(self, args, state, control, model=None, processing_class=None, **kwargs):
-        if state.global_step == 0 or state.global_step % EVAL_STEPS != 0:
-            return control
-        tokenizer = processing_class
-        if tokenizer is None:
-            print("[EvalRewardCallback] WARNING: tokenizer is None, skipping eval")
-            return control
-        mean_reward, task_rewards = self._run_eval(model, tokenizer, args)
-        improved = mean_reward > self.best_reward + self.delta
-        status = "↑ improved" if improved else f"↔ no gain ({self.no_improve_count + 1}/{self.patience})"
-        log_dict = {
-            "eval/mean_reward":      mean_reward,
-            "eval/best_reward":      max(self.best_reward, mean_reward),
-            "eval/no_improve_count": self.no_improve_count,
-        }
-        for task, rewards in task_rewards.items():
-            if rewards:
-                log_dict[f"eval/{task}_reward"] = sum(rewards) / len(rewards)
-        wandb.log(log_dict, step=state.global_step)
-        print(f"\n[EvalReward] step={state.global_step} | mean={mean_reward:.4f} | best={self.best_reward:.4f} | {status}")
-        for task, rewards in task_rewards.items():
-            if rewards:
-                print(f"  {task}: {sum(rewards)/len(rewards):.3f} (n={len(rewards)})")
-        if improved:
-            self.best_reward      = mean_reward
-            self.no_improve_count = 0
-        else:
-            self.no_improve_count += 1
-            if self.no_improve_count >= self.patience:
-                print(f"[EarlyStopping] No improvement ≥ {self.delta} for {self.patience} consecutive evals. Halting.")
-                wandb.log({"early_stop/step": state.global_step}, step=state.global_step)
-                control.should_training_stop = True
-        return control
-    def _run_eval(self, model, tokenizer, args):
-        FastLanguageModel.for_inference(model)
-        rewards = []
-        task_rewards = {}
-        subset = self.eval_records[:EVAL_MAX_SAMPLES]
-        for record in subset:
-            msgs = record["prompt"]
-            text = tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
-            inputs = tokenizer(text, return_tensors="pt", truncation=True,
-                             max_length=args.max_prompt_length).to(model.device)
-            with torch.no_grad():
-                out = model.generate(**inputs, max_new_tokens=EVAL_MAX_TOKENS,
-                                    temperature=EVAL_TEMPERATURE, do_sample=True)
-            resp = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
-            r = self.reward_fn([resp], [text])[0]
-            rewards.append(r)
-            user_text = " ".join(m.get("content", "") for m in msgs if m.get("role") == "user")
-            task = _classify_task_type(user_text)
-            task_rewards.setdefault(task, []).append(r)
-        FastLanguageModel.for_training(model)
-        mean = sum(rewards) / len(rewards) if rewards else 0.0
-        return mean, task_rewards
-class EntropyMonitorCallback(TrainerCallback):
-    """v3 NEW: Monitor entropy collapse indicators (Skywork-OR1 §4)."""
-    def __init__(self):
-        self.consecutive_ceiling_hits = 0
-    def on_log(self, args, state, control, logs=None, **kwargs):
-        if not logs:
-            return
-        step = state.global_step
-        monitor = {}
-        comp_len = logs.get("completion_length", 0)
-        if comp_len > 0:
-            ratio = comp_len / MAX_COMPLETION_LENGTH
-            monitor["monitor/completion_ratio"] = ratio
-            if ratio > 0.95:
-                self.consecutive_ceiling_hits += 1
-                if self.consecutive_ceiling_hits >= 3:
-                    print(f"⚠️ Step {step}: Completion ceiling hit {self.consecutive_ceiling_hits} consecutive times.")
-            else:
-                self.consecutive_ceiling_hits = 0
-        reward_std = logs.get("reward_std", logs.get("rewards/commerce_reward_fn/std", 0))
-        if reward_std is not None:
-            monitor["monitor/reward_std"] = reward_std
-            if reward_std < 0.01:
-                print(f"⚠️ Step {step}: reward_std={reward_std:.4f} — near-zero variance")
-        clip_high = logs.get("clip_ratio/high_mean", 0)
-        clip_low = logs.get("clip_ratio/low_mean", 0)
-        if clip_high is not None and clip_low is not None:
-            total_clip = clip_high + abs(clip_low)
-            monitor["monitor/total_clip_ratio"] = total_clip
-            if total_clip > 0.01 and step > 10:
-                print(f"✓ Step {step}: clip_ratio={total_clip:.3f} — policy is updating")
-        if monitor and wandb.run:
-            wandb.log(monitor, step=step)
-FastLanguageModel.for_training(model)
-grpo_config = GRPOConfig(
-    output_dir=str(CHECKPOINT_DIR),
-    num_generations=NUM_GENERATIONS,
-    scale_rewards=SCALE_REWARDS,
-    max_completion_length=MAX_COMPLETION_LENGTH,
-    temperature=TEMPERATURE,
-    max_steps=MAX_STEPS,
-    num_train_epochs=NUM_EPOCHS,
-    per_device_train_batch_size=BATCH_SIZE,
-    gradient_accumulation_steps=GRAD_ACCUM,
-    learning_rate=LEARNING_RATE,
-    warmup_ratio=0.1,
-    lr_scheduler_type="cosine",
-    fp16=False,
-    bf16=True,
-    logging_steps=1,
-    logging_first_step=True,
-    disable_tqdm=True,
-    save_steps=SAVE_STEPS,
-    save_total_limit=SAVE_TOTAL_LIMIT,
-    save_only_model=True,
-    eval_steps=EVAL_STEPS,
-    report_to="wandb",
-    max_prompt_length=MAX_SEQ_LENGTH - MAX_COMPLETION_LENGTH,
-    seed=42,
-    remove_unused_columns=False,
-    **({"use_vllm": True, "vllm_mode": "colocate",
-        "vllm_enable_sleep_mode": True} if USE_VLLM else {}),
-)
-eval_cb = EvalRewardCallback(eval_records=list(eval_dataset), reward_fn=commerce_reward_fn)
-entropy_cb = EntropyMonitorCallback()
-TrainerClass = GRPOTrainer if USE_VLLM else UnslothGRPOTrainer
-trainer = TrainerClass(
-    model=model,
-    reward_funcs=commerce_reward_fn,
-    args=grpo_config,
-    train_dataset=train_dataset,
-    processing_class=tokenizer,
-    callbacks=[eval_cb, entropy_cb],
-)
-print(f"{'='*70}")
-print(f"GRPO v3 Training — Ready to Launch")
-print(f"{'='*70}")
-print(f"  Trainer:       {TrainerClass.__name__}")
-print(f"  Max steps:     {MAX_STEPS}")
-print(f"  Temperature:   {TEMPERATURE} (v2 was 0.8)")
-print(f"  Completion:    {MAX_COMPLETION_LENGTH} tokens (v2 was 2048)")
-print(f"  Generations:   {NUM_GENERATIONS} per prompt (v2 was 8)")
-print(f"  Learning rate: {LEARNING_RATE} (v2 was 5e-7)")
-print(f"  Save every:    {SAVE_STEPS} steps (keep {SAVE_TOTAL_LIMIT})")
-print(f"  Eval every:    {EVAL_STEPS} steps ({EVAL_MAX_SAMPLES} samples × {EVAL_MAX_TOKENS} tok)")
-print(f"  Patience:      {EARLY_STOPPING_PATIENCE} evals ({EARLY_STOPPING_PATIENCE * EVAL_STEPS} steps)")
-print(f"  Resume:        {resume_from is not None}")
-print(f"{'='*70}")
-t_start = time.time()
-result  = trainer.train(resume_from_checkpoint=resume_from)
-elapsed = time.time() - t_start
-wandb.log({
-    "train/final_loss":       result.training_loss,
-    "train/duration_hours":   elapsed / 3600,
-    "train/total_steps":      result.global_step,
-    "eval/best_reward_final": eval_cb.best_reward,
-})
-wandb.finish()
-print(f"\n{'='*70}")
-print(f"GRPO v3 Training Complete")
-print(f"  Loss:        {result.training_loss:.6f}")
-print(f"  Steps:       {result.global_step}")
-print(f"  Duration:    {elapsed/3600:.1f}h")
-print(f"  Best eval R: {eval_cb.best_reward:.4f}")
-print(f"  Trainer:     {TrainerClass.__name__}")
-print(f"{'='*70}")
-```
----
-## Cell 12: Save Adapter
-```python
-GRPO_ADAPTER_DIR.mkdir(parents=True, exist_ok=True)
-model.save_pretrained(str(GRPO_ADAPTER_DIR))
-tokenizer.save_pretrained(str(GRPO_ADAPTER_DIR))
-summary = {
-    "model_id": MODEL_ID,
-    "sft_adapter": str(SFT_ADAPTER_DIR),
-    "method": "GRPO",
-    "version": "v3",
-    "train_loss": result.training_loss,
-    "best_eval_reward": eval_cb.best_reward,
-    "num_prompts": len(train_dataset),
-    "num_generations": NUM_GENERATIONS,
-    "scale_rewards": SCALE_REWARDS,
-    "temperature": TEMPERATURE,
-    "learning_rate": LEARNING_RATE,
-    "beta": BETA,
-    "max_completion_length": MAX_COMPLETION_LENGTH,
-    "max_steps": MAX_STEPS,
-    "actual_steps": result.global_step,
-    "epochs": NUM_EPOCHS,
-    "max_seq_length": MAX_SEQ_LENGTH,
-    "duration_seconds": round(elapsed),
-    "gpu": "L4",
-    "platform": "vertex-ai-workbench",
-    "v3_fixes": [
-        "temperature=1.0 (Skywork-OR1)",
-        "max_completion_length=4096 (Dr. GRPO)",
-        "learning_rate=2e-6 (4x v2)",
-        "beta=0.0 (Dr. GRPO)",
-        "staged rewards (Reasoning-SQL)",
-        "zero-advantage noise (Skywork-OR1)",
-        "entropy monitoring callback",
-    ],
-}
-with open(GRPO_ADAPTER_DIR / "training_summary.json", "w") as f:
-    json.dump(summary, f, indent=2)
-print(f"✓ Adapter saved to {GRPO_ADAPTER_DIR}")
-print(f"  Files: {[f.name for f in GRPO_ADAPTER_DIR.iterdir() if f.is_file()]}")
-```
----
-## Cell 13: Validation
-```python
-FastLanguageModel.for_inference(model)
-system_msg = {"role": "system", "content": SYSTEM_PT}
-test_prompts = [
-    {"role": "user", "content": (
-        "Analise esta avaliação de e-commerce brasileiro e extraia dados estruturados.\n\n"
-        "nota=2/5 | status=delivered\ntítulo: decepcionado\n"
-        "texto: Produto veio com defeito e o vendedor não respondeu.\n\n"
-        "Retorne um objeto JSON com exatamente estas chaves:\n"
-        "sentiment, sentiment_score, churn_risk, delivery_issue, product_issue, "
-        "seller_issue, main_complaint, complaint_category, repeat_intent, would_recommend"
-    )},
-    {"role": "user", "content": (
-        "Analise esta avaliação de e-commerce brasileiro e extraia dados estruturados.\n\n"
-        "nota=5/5 | status=delivered\ntítulo: adorei!\n"
-        "texto: Entrega rápida e produto exatamente como descrito. Recomendo!\n\n"
-        "Retorne um objeto JSON com exatamente estas chaves:\n"
-        "sentiment, sentiment_score, churn_risk, delivery_issue, product_issue, "
-        "seller_issue, main_complaint, complaint_category, repeat_intent, would_recommend"
-    )},
-    {"role": "user", "content": "Quais são as categorias de reclamação mais frequentes e como afetam a nota média?"},
-    {"role": "user", "content": "Analise a retenção de clientes afetados por product_quality."},
-    {"role": "user", "content": (
-        "Perfil do cliente:\n- Estado: MG\n- Valor do pedido: R$150\n"
-        "- Reclamação: produto com defeito\n- Nota: 1.0/5\n\n"
-        "Este cliente deve receber uma notificação de reengajamento?"
-    )},
-    {"role": "user", "content": "Compare a satisfação de clientes em SP vs RJ."},
-    {"role": "user", "content": (
-        "Crie uma notificação push de reengajamento para um cliente em SP "
-        "que reclamou de atraso na entrega. Nota: 2/5."
-    )},
-]
-print("=" * 70)
-print("GRPO v3 Validation")
-print("=" * 70)
-v3_rewards = []
-for i, prompt in enumerate(test_prompts):
-    messages = [system_msg, prompt]
-    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
-    inputs = tokenizer(text, return_tensors="pt").to(model.device)
-    outputs = model.generate(**inputs, max_new_tokens=MAX_COMPLETION_LENGTH, temperature=0.1, do_sample=True)
-    gen_tokens = outputs.shape[1] - inputs["input_ids"].shape[1]
-    response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
-    reward = commerce_reward_fn([response], [text])[0]
-    v3_rewards.append(reward)
-    answer = strip_think(response)
-    task = _classify_task_type(prompt["content"])
-    hit_ceiling = gen_tokens >= MAX_COMPLETION_LENGTH
-    print(f"\n--- Sample {i+1} [{task}] (reward={reward:.2f}, tokens={gen_tokens}, ceiling={'HIT' if hit_ceiling else 'ok'}) ---")
-    print(f"Prompt: {prompt['content'][:80]}...")
-    print(f"Answer: {answer[:400]}")
-print(f"\n{'='*70}")
-print(f"v3 Validation Summary")
-print(f"{'='*70}")
-print(f"  Mean reward: {sum(v3_rewards)/len(v3_rewards):.3f}")
-print(f"  Min:         {min(v3_rewards):.3f}")
-print(f"  Max:         {max(v3_rewards):.3f}")
-print()
-print(f"  Comparison to baselines:")
-print(f"    SFT calibration (Cell 7): mean=0.38")
-print(f"    GRPO v2 validation:       mean=0.54")
-print(f"    GRPO v3 validation:       mean={sum(v3_rewards)/len(v3_rewards):.3f}")
-v3_vs_v2 = (sum(v3_rewards)/len(v3_rewards) - 0.54) / 0.54 * 100
-print(f"    v3 vs v2:                 {v3_vs_v2:+.1f}%")
-```