Create v4_2-handoff.md

(docs): V4.2 is the last 0.5B run. Its purpose is not to find more improvement — it is to know exactly what was found and why, with enough statistical rigor to say so in writing.

Files changed (1) hide show

docs/v4_2-handoff.md +450 -0

docs/v4_2-handoff.md ADDED Viewed

	@@ -0,0 +1,450 @@

+# V4.2 Handoff — Closing the 0.5B to Gold Standard
+**Date:** 2026-04-27
+**Context:** V4.1 achieved eval_best=0.645 (+35.5% over V4). Parser fix + constant LR
+were the two decisive changes. SQL Q&A remains stagnant (+3.8%). Insights swing
+(0.84→0.62) is eval noise, not regression. Goal: close remaining gaps before declaring
+the 0.5B scientifically complete and making the methodology case.
+---
+## What V4.1 Left Open (Ordered by Urgency)
+| Gap | Evidence | Blocks |
+|-----|----------|--------|
+| Eval suite too small (n≈2 for insights/push) | Insights swings ±0.22 between consecutive evals | Every subsequent claim about per-task performance |
+| SQL Q&A stagnant at +3.8% | Reward function doesn't validate SQL quality | Knowing whether this is reward ceiling or capacity limit |
+| Only 40% of training data seen | train/epoch=0.404 at step 600 | Knowing true performance ceiling |
+| No reward function audit protocol | Parser bug persisted 3 versions | Next parser-class bug catching you again |
+| Single run, no confidence intervals | No error bars on any reported number | Credibility of the methodology case |
+---
+## V4.2 Changes — Exactly What to Implement
+### Change 1: Expand Eval Suite to 50+ Samples (Do Before Training)
+**What:** Build a static eval set with minimum 15 samples per task type, stratified and
+held fixed across all future runs.
+```python
+EVAL_SAMPLES_PER_TASK = {
+    "extraction": 20,
+    "sql_qa":     15,
+    "insights":   15,
+    "push":       15,
+}
+# Total: 65 eval samples (was 15 mixed)
+```
+**How:** Sample from `data/pairs/eval.jsonl`, verify task distribution, save as
+`data/pairs/eval_v2_stratified.jsonl`. Never resample — the same 65 prompts must be
+used across V4.2 seeds 42, 123, 456.
+**Why this is Change 1:** The insights regression (0.84→0.62) made it impossible to
+distinguish learning from noise. At n≈2, standard error = ±0.22. At n=15, standard
+error drops to ±0.06. Every other change is uninterpretable without this fixed.
+**Report format:** `mean ± 1.96 × std/sqrt(n)` for 95% CI on each task.
+---
+### Change 2: Reward Function Audit Before Training (30-minute Protocol)
+**What:** Before launching any training cell, generate 20 completions (5 per task),
+manually score them 0-10, compute Spearman ρ between human scores and reward function.
+```python
+# Cell X — Reward Audit (add between calibration and training)
+from scipy.stats import spearmanr
+AUDIT_PROMPTS_PER_TASK = 5
+audit_human_scores = []   # filled manually by the person running the notebook
+audit_auto_scores  = []
+# Generate completions at temp=0.1 (deterministic), print each one
+# Person assigns 0-10 score, enters into audit_human_scores list
+# Then:
+rho, p_value = spearmanr(audit_human_scores, audit_auto_scores)
+print(f"Reward function calibration: ρ={rho:.2f} (p={p_value:.3f})")
+assert rho > 0.7, f"Reward function miscalibrated (ρ={rho:.2f} < 0.70). Fix before training."
+```
+**Gate:** ρ > 0.70. If below, find the discrepancy (another parser bug, wrong field
+weighting, wrong task classifier) before spending GPU hours.
+**Why:** The V1-V4 parser bug would have been caught in 30 minutes with this protocol.
+This is the single cheapest addition to the methodology.
+---
+### Change 3: SQL Reward Overhaul
+**What:** Replace heuristic vocabulary matching with a validation-aware reward that
+distinguishes "mentions SQL keywords" from "produces a correct analytical answer."
+```python
+def reward_sql_qa_v2(completion: str) -> float:
+    answer = strip_think(completion)
+    if not answer.strip():
+        return 0.0
+    score = 0.0
+    # Tier 1 (0.30): SQL structure detected
+    sql_keywords = ["SELECT", "FROM", "WHERE", "GROUP BY", "ORDER BY", "JOIN", "HAVING", "COUNT", "AVG", "SUM"]
+    sql_found = sum(1 for kw in sql_keywords if kw in answer.upper())
+    if sql_found >= 3:
+        score += 0.30
+    elif sql_found >= 1:
+        score += 0.15
+    # Tier 2 (0.25): Answer has both query and explanation
+    has_query  = bool(re.search(r"```sql|SELECT.{5,}FROM", answer, re.IGNORECASE | re.DOTALL))
+    has_answer = any(w in answer.lower() for w in ["resultado", "total", "média", "mostra", "portanto"])
+    if has_query and has_answer:
+        score += 0.25
+    elif has_query or has_answer:
+        score += 0.12
+    # Tier 3 (0.25): Numerical specificity
+    numbers = re.findall(r"\d+(?:[.,]\d+)?(?:\s*%)?", answer)
+    score += min(0.25, 0.05 * len(numbers))
+    # Tier 4 (0.20): Portuguese business domain coherence
+    pt_domain = ["pedidos", "clientes", "vendedores", "produtos", "avaliação",
+                 "entrega", "reclamação", "satisfação", "categoria", "período"]
+    score += min(0.20, 0.04 * sum(1 for w in pt_domain if w in answer.lower()))
+    return min(score, 1.0)
+```
+**Why:** The current SQL reward scores a completion that says "os pedidos de clientes em
+2017 totalizaram 15.000" the same as one that says "SELECT COUNT(*) FROM orders WHERE
+year=2017 → result: 15,000 pedidos". The V2 reward can't tell the difference. The model
+found the easier path (domain vocabulary + numbers) and stopped improving.
+**Expected outcome:** Two possibilities after the training run:
+- SQL reward improves +20%+ → reward was the ceiling, model has capacity
+- SQL reward still stagnates → confirmed 0.5B capacity limit for analytical reasoning
+Both are valid scientific conclusions. The point is to distinguish them.
+---
+### Change 4: Multi-Epoch Training (1,500 Steps)
+**What:** `MAX_STEPS = 600 → 1500`, training for ~2.5 full epochs with dataset shuffling.
+```python
+MAX_STEPS        = 1500    # was 600
+SAVE_STEPS       = 100     # was 50
+EVAL_STEPS       = 50      # was 20, now more frequent per-epoch boundary
+```
+**Why:** V4.1 saw only 40% of training data. The eval was still improving at step 500
+(best eval at step 500/600). There is signal remaining in the data. 1,500 steps covers
+the full dataset ~2× with shuffling between passes.
+**Risk monitoring:** Watch `eval/mean_reward` vs `train/reward` divergence. If eval
+plateaus while train keeps rising past step 800, the model is overfitting. Stop training
+and save the pre-overfitting checkpoint.
+**Shuffling:** Ensure `shuffle=True` in the DataLoader, or use a different seed per
+epoch for `Dataset.shuffle()`. Seeing the same prompts in the same order each epoch
+eliminates the diversity benefit.
+**Estimated cost:** ~12 hours on L4. Run overnight.
+---
+### Change 5: GDPO per-reward normalization
+```python
+# GDPO: normalize each reward component separately before summing
+# Instead of: reward = sum(extraction + sql + insights + push) → normalize batch
+# Do:         normalize each component independently, then sum
+def gdpo_normalize(component_rewards: dict[str, list[float]]) -> list[float]:
+    """Per-component normalization before aggregation (GDPO 2601.05242 §3.1)."""
+    normalized = {}
+    for task, rewards in component_rewards.items():
+        rewards_t = torch.tensor(rewards, dtype=torch.float32)
+        std = rewards_t.std()
+        if std > 1e-8:
+            normalized[task] = ((rewards_t - rewards_t.mean()) / std).tolist()
+        else:
+            normalized[task] = [0.0] * len(rewards)  # zero-variance group
+    # Sum normalized components per sample
+    n = len(next(iter(normalized.values())))
+    return [sum(normalized[t][i] for t in normalized) for i in range(n)]
+```
+This requires `commerce_reward_fn` to return per-component rewards (not just the sum), and
+the trainer to call `gdpo_normalize` before computing advantages. Moderate effort — requires
+a custom trainer subclass.
+---
+### Change 6: Dynamic Task Weighting (MT-GRPO IWU)
+**What:** Track per-task reward improvement rates every N eval steps and increase
+sampling probability for stagnating tasks.
+```python
+class DynamicTaskSampler:
+    """
+    MT-GRPO §3.2: Improvement-aware Weight Update.
+    Upweights tasks with stagnating reward, downweights converging tasks.
+    """
+    def __init__(self, tasks, initial_weight=0.25, update_interval=50):
+        self.tasks    = tasks
+        self.weights  = {t: initial_weight for t in tasks}
+        self.history  = {t: [] for t in tasks}
+        self.interval = update_interval
+    def update(self, step: int, per_task_rewards: dict):
+        if step % self.interval != 0 or step == 0:
+            return
+        for task, reward in per_task_rewards.items():
+            self.history[task].append(reward)
+            if len(self.history[task]) >= 2:
+                improvement = self.history[task][-1] - self.history[task][-2]
+                if improvement < 0.01:         # stagnating
+                    self.weights[task] = min(0.60, self.weights[task] * 1.3)
+                elif improvement > 0.05:       # improving fast
+                    self.weights[task] = max(0.10, self.weights[task] * 0.85)
+        # Normalize to sum to 1
+        total = sum(self.weights.values())
+        self.weights = {t: w / total for t, w in self.weights.items()}
+    def sample_indices(self, dataset, n_samples: int) -> list[int]:
+        """Sample indices with probability proportional to task weights."""
+        task_indices = {t: [] for t in self.tasks}
+        for i, record in enumerate(dataset):
+            user_txt = " ".join(m["content"] for m in record["prompt"] if m["role"] == "user")
+            task = _classify_task_type(user_txt)
+            task_indices[task].append(i)
+        sampled = []
+        for task, weight in self.weights.items():
+            n = max(1, int(n_samples * weight))
+            pool = task_indices.get(task, [])
+            if pool:
+                sampled.extend(random.sample(pool, min(n, len(pool))))
+        return sampled
+```
+**Why:** SQL Q&A improved +3.8% vs insights +68% under equal weighting. MT-GRPO's
+Theorem 1 proves GRPO allocates gradient budget to higher-variance tasks. SQL gets
+starved. IWU corrects this by progressively increasing SQL's sampling probability as
+its improvement rate drops below threshold.
+**Log:** `wandb.log({"sampler/sql_weight": sampler.weights["sql_qa"], ...})` at every
+update. You want to see SQL weight increasing from 0.40 (initial) toward 0.60+ if the
+model is still stagnating.
+---
+### Change 7: Three Seeds for Reproducibility
+**What:** Run V4.2 with `seed ∈ {42, 123, 456}`. Report `mean ± std` across seeds
+for all headline metrics.
+```python
+# In Cell 3:
+SEEDS = [42, 123, 456]
+CURRENT_SEED = 42   # change per run
+# In GRPOConfig and Dataset prep:
+seed = CURRENT_SEED
+random.seed(seed)
+torch.manual_seed(seed)
+```
+**Why:** Three seeds is the minimum for a credible ML result. A single lucky or unlucky
+random initialization can produce ±0.05 variance in eval reward. With three seeds, you
+can report `0.645 ± 0.02` instead of just `0.645` — the former is publishable, the
+latter is an observation.
+**Cost:** 3 × ~12h = ~36h on L4. Do these in parallel if you have access to multiple
+GPUs, or sequentially overnight.
+**Report format:**
+```
+| Task         | Seed 42 | Seed 123 | Seed 456 | Mean ± 95% CI |
+|---|---|---|---|---|
+| Extraction   | ...     | ...      | ...      | X.XX ± 0.0X   |
+| SQL Q&A      | ...     | ...      | ...      | X.XX ± 0.0X   |
+| Insights     | ...     | ...      | ...      | X.XX ± 0.0X   |
+| Push         | ...     | ...      | ...      | X.XX ± 0.0X   |
+| **Mean**     | ...     | ...      | ...      | **X.XX ± 0.0X** |
+```
+---
+### Change 8: Best checkpoint saving
+The V4.1 config had `save_total_limit=5` and `save_only_model=True` but no explicit best-checkpoint
+logic. `GRPOTrainer` doesn't have a native `load_best_model_at_end` equivalent the way `Trainer` does.
+The correct pattern is to track best eval reward in `EvalRewardCallback` and save explicitly when
+it improves:
+```python
+# In EvalRewardCallback.on_step_end, after computing mean_reward:
+if improved:
+    self.best_reward = mean_reward
+    self.best_step   = state.global_step
+    self.no_improve_count = 0
+    # Save best checkpoint explicitly
+    best_path = ADAPTER_DIR / "best_checkpoint"
+    model.save_pretrained(str(best_path))
+    tokenizer.save_pretrained(str(best_path))
+    print(f"  ✓ Best checkpoint saved → {best_path} (reward={mean_reward:.4f})")
+```
+This guarantees the adapter saved in `best_checkpoint/` is always the peak eval reward, regardless of where
+training ends or when early stopping fires. Add this to the `EvalRewardCallback` block and the final Cell 15
+should load from `best_checkpoint/` rather than the last training step.
+---
+## V4.2 Config Summary
+```python
+# Changes from V4.1
+MAX_STEPS                  = 1500     # was 600
+EVAL_STEPS                 = 50       # was 20
+SAVE_STEPS                 = 100      # was 50
+SEEDS                      = [42, 123, 456]  # run separately
+# Structural additions (not hyperparameters)
+eval_set                   = "data/pairs/eval_v2_stratified.jsonl"  # 65 fixed samples
+reward_sql_qa              = reward_sql_qa_v2  # new SQL reward (Change 3)
+dynamic_sampler            = DynamicTaskSampler(...)  # Change 5
+# Everything else UNCHANGED from V4.1 (these are validated):
+NUM_GENERATIONS            = 16
+MAX_COMPLETION_LENGTH      = 512
+TEMPERATURE                = 1.0
+BETA                       = 0.0
+SCALE_REWARDS              = False
+LEARNING_RATE              = 5e-6
+lr_scheduler_type          = "constant_with_warmup"
+warmup_ratio               = 0.05
+BATCH_SIZE                 = 2
+GRAD_ACCUM                 = 1
+LORA_R                     = 16
+LORA_ALPHA                 = 32
+```
+---
+## What to Observe During V4.2
+### The four questions V4.2 must answer
+**Q1: Does SQL reward improve with the new reward function?**
+Watch `eval/sql_qa` trajectory. If it improves +15%+ over V4.1 by step 600, the old
+reward was the ceiling. If it still stagnates at ~0.547, the 0.5B model has a genuine
+capacity limit on analytical reasoning. Both are valid answers — the important thing is
+knowing which.
+**Q2: Is the insights regression noise or forgetting?**
+With n=15 insights samples, the swing becomes ±0.06 instead of ±0.22. If insights
+stabilizes at 0.75-0.85 throughout multi-epoch training, it was noise. If it consistently
+drops after step 800, it's catastrophic forgetting — and you need MT-GRPO's constrained
+optimization to address it.
+**Q3: Does multi-epoch training push eval above 0.70?**
+V4.1 was still improving at step 500 with eval_best=0.645. 1,500 steps of sustained
+training with constant LR and full data coverage should push this further. Target: ≥0.70
+mean across all tasks.
+**Q4: Are results reproducible across seeds?**
+If `mean ± std` across three seeds shows std < 0.03, the result is robust. If std > 0.05,
+there's significant initialization sensitivity that needs to be understood before claiming
+these numbers.
+### WandB metrics to watch in real time
+| Metric | Expected | Stop if |
+|---|---|---|
+| `eval/sql_qa` | Steeper upward slope vs V4.1 | Still flat at step 200 (reward still wrong) |
+| `eval/insights` | Stable 0.75-0.85, no big swings | Drops consistently below 0.65 after step 800 |
+| `eval/mean_reward` | Continues improving past step 500 | Plateaus at same 0.645 as V4.1 |
+| `sampler/sql_weight` | Rising from 0.40 toward 0.55-0.60 | Stays flat (IWU not triggering) |
+| `train/loss` vs `eval` | Both decreasing | Train drops, eval plateaus (overfitting) |
+---
+## Drawing Conclusions for the Methodology Case
+V4.2 closes the 0.5B story. After the runs complete, the project demonstrates:
+### What was systematically learned across versions
+| Version | Failure identified | Fix | Isolated contribution |
+|---|---|---|---|
+| V1 | 68% extraction task imbalance → task collapse | Rebalanced to 40/40/10/10 | Established balanced multi-task training |
+| V2 | DPO near-no-op (9 gradient steps, flat loss) | Replaced DPO with GRPO | Established GRPO as the correct alignment method |
+| V3 | Think model APO anchor + 2,628-token `<think>` overhead | Switched to Instruct model | Isolated architectural constraint |
+| V4 | Cosine LR decay to zero by step 130 | Constant LR schedule | +370 productive training steps |
+| V4 | JSON parser failed on PT-BR decimals | json-repair + PT normalizer | 3.25× measured extraction improvement |
+| V4.1 | Stagnant SQL (+3.8%) | New reward function + dynamic weighting | To be measured in V4.2 |
+| V4.1 | Single run, no CIs | 3-seed reproducibility | Statistical credibility |
+### The scientific argument
+Each version was not "trying something different" — it was **testing a specific hypothesis**
+with a controlled change. The key methodological strength is causal attribution: when
+eval_best went from 0.476 to 0.645, the V4.1 report decomposed the sources:
+- ~0.13 from parser fix (measured at step 20, before GRPO learning)
+- ~0.13 from GRPO learning (measured from step 20 to peak)
+This is the scientific standard: not just "it improved" but "here is why, with evidence."
+V4.2 completes this arc by either showing SQL improvement (proving reward was the ceiling)
+or confirming capacity limit (proving 0.5B has an irreducible floor on analytical tasks).
+Either outcome is a clean, documentable finding.
+### The external benchmark gap
+One gap remains for true gold-standard credibility: no external benchmark comparison.
+The current eval suite is project-internal. For the methodology case, consider running
+the V4.2 final checkpoint against a subset of Portuguese NLP benchmarks from the Tucano2
+model card (BLUEX, OAB, ENEM) on the specific knowledge domains your training covered.
+Even partial coverage demonstrates the model didn't regress on general Portuguese ability
+while improving on domain tasks. This is the "catastrophic forgetting" test at the
+benchmark level, complementing the within-run insights stability test.
+---
+## Notebook Structure for V4.2
+```
+Cell  1: Dependencies + env vars
+Cell  2: GPU + Unsloth + TRL verification gate
+Cell  3: Config constants (CURRENT_SEED variable)
+Cell  4: Load model + critical generation_config overrides
+Cell  5: Token ID verification gate
+Cell  6: KV cache diagnostic gate
+Cell  7: Reward functions v2 (including reward_sql_qa_v2)
+Cell  8: Reward function audit (30-min protocol, ρ > 0.70 gate)
+Cell  9: Build stratified eval set (65 samples, fixed)
+Cell 10: Dataset preparation + DynamicTaskSampler init
+Cell 11: Smoke test (1 step, VRAM check)
+Cell 12: Probe run (10 steps, clip_ratio > 0 on 3/10 gate)
+Cell 13: W&B init + full training (1,500 steps)
+Cell 14: Post-training validation (stratified 65-sample eval)
+Cell 15: Save adapter
+Cell 16: Results table generation (for reporting)
+```
+**Run three times, changing only `CURRENT_SEED` in Cell 3.**
+---
+*V4.2 is the last 0.5B run. Its purpose is not to find more improvement —
+it is to know exactly what was found and why, with enough statistical rigor
+to say so in writing.*