| # V4.2 Handoff — Closing the 0.5B to Gold Standard |
|
|
| **Date:** 2026-04-27 |
| **Context:** V4.1 achieved eval_best=0.645 (+35.5% over V4). Parser fix + constant LR |
| were the two decisive changes. SQL Q&A remains stagnant (+3.8%). Insights swing |
| (0.84→0.62) is eval noise, not regression. Goal: close remaining gaps before declaring |
| the 0.5B scientifically complete and making the methodology case. |
| |
| --- |
| |
| ## What V4.1 Left Open (Ordered by Urgency) |
| |
| | Gap | Evidence | Blocks | |
| |-----|----------|--------| |
| | Eval suite too small (n≈2 for insights/push) | Insights swings ±0.22 between consecutive evals | Every subsequent claim about per-task performance | |
| | SQL Q&A stagnant at +3.8% | Reward function doesn't validate SQL quality | Knowing whether this is reward ceiling or capacity limit | |
| | Only 40% of training data seen | train/epoch=0.404 at step 600 | Knowing true performance ceiling | |
| | No reward function audit protocol | Parser bug persisted 3 versions | Next parser-class bug catching you again | |
| | Single run, no confidence intervals | No error bars on any reported number | Credibility of the methodology case | |
| |
| --- |
| |
| ## V4.2 Changes — Exactly What to Implement |
| |
| ### Change 1: Expand Eval Suite to 50+ Samples (Do Before Training) |
| |
| **What:** Build a static eval set with minimum 15 samples per task type, stratified and |
| held fixed across all future runs. |
| |
| ```python |
| EVAL_SAMPLES_PER_TASK = { |
| "extraction": 20, |
| "sql_qa": 15, |
| "insights": 15, |
| "push": 15, |
| } |
| # Total: 65 eval samples (was 15 mixed) |
| ``` |
| |
| **How:** Sample from `data/pairs/eval.jsonl`, verify task distribution, save as |
| `data/pairs/eval_v2_stratified.jsonl`. Never resample — the same 65 prompts must be |
| used across V4.2 seeds 42, 123, 456. |
|
|
| **Why this is Change 1:** The insights regression (0.84→0.62) made it impossible to |
| distinguish learning from noise. At n≈2, standard error = ±0.22. At n=15, standard |
| error drops to ±0.06. Every other change is uninterpretable without this fixed. |
|
|
| **Report format:** `mean ± 1.96 × std/sqrt(n)` for 95% CI on each task. |
|
|
| --- |
|
|
| ### Change 2: Reward Function Audit Before Training (30-minute Protocol) |
|
|
| **What:** Before launching any training cell, generate 20 completions (5 per task), |
| manually score them 0-10, compute Spearman ρ between human scores and reward function. |
|
|
| ```python |
| # Cell X — Reward Audit (add between calibration and training) |
| from scipy.stats import spearmanr |
| |
| AUDIT_PROMPTS_PER_TASK = 5 |
| audit_human_scores = [] # filled manually by the person running the notebook |
| audit_auto_scores = [] |
| |
| # Generate completions at temp=0.1 (deterministic), print each one |
| # Person assigns 0-10 score, enters into audit_human_scores list |
| # Then: |
| rho, p_value = spearmanr(audit_human_scores, audit_auto_scores) |
| print(f"Reward function calibration: ρ={rho:.2f} (p={p_value:.3f})") |
| assert rho > 0.7, f"Reward function miscalibrated (ρ={rho:.2f} < 0.70). Fix before training." |
| ``` |
|
|
| **Gate:** ρ > 0.70. If below, find the discrepancy (another parser bug, wrong field |
| weighting, wrong task classifier) before spending GPU hours. |
|
|
| **Why:** The V1-V4 parser bug would have been caught in 30 minutes with this protocol. |
| This is the single cheapest addition to the methodology. |
|
|
| --- |
|
|
| ### Change 3: SQL Reward Overhaul |
|
|
| **What:** Replace heuristic vocabulary matching with a validation-aware reward that |
| distinguishes "mentions SQL keywords" from "produces a correct analytical answer." |
|
|
| ```python |
| def reward_sql_qa_v2(completion: str) -> float: |
| answer = strip_think(completion) |
| if not answer.strip(): |
| return 0.0 |
| |
| score = 0.0 |
| |
| # Tier 1 (0.30): SQL structure detected |
| sql_keywords = ["SELECT", "FROM", "WHERE", "GROUP BY", "ORDER BY", "JOIN", "HAVING", "COUNT", "AVG", "SUM"] |
| sql_found = sum(1 for kw in sql_keywords if kw in answer.upper()) |
| if sql_found >= 3: |
| score += 0.30 |
| elif sql_found >= 1: |
| score += 0.15 |
| |
| # Tier 2 (0.25): Answer has both query and explanation |
| has_query = bool(re.search(r"```sql|SELECT.{5,}FROM", answer, re.IGNORECASE | re.DOTALL)) |
| has_answer = any(w in answer.lower() for w in ["resultado", "total", "média", "mostra", "portanto"]) |
| if has_query and has_answer: |
| score += 0.25 |
| elif has_query or has_answer: |
| score += 0.12 |
| |
| # Tier 3 (0.25): Numerical specificity |
| numbers = re.findall(r"\d+(?:[.,]\d+)?(?:\s*%)?", answer) |
| score += min(0.25, 0.05 * len(numbers)) |
| |
| # Tier 4 (0.20): Portuguese business domain coherence |
| pt_domain = ["pedidos", "clientes", "vendedores", "produtos", "avaliação", |
| "entrega", "reclamação", "satisfação", "categoria", "período"] |
| score += min(0.20, 0.04 * sum(1 for w in pt_domain if w in answer.lower())) |
| |
| return min(score, 1.0) |
| ``` |
| |
| **Why:** The current SQL reward scores a completion that says "os pedidos de clientes em |
| 2017 totalizaram 15.000" the same as one that says "SELECT COUNT(*) FROM orders WHERE |
| year=2017 → result: 15,000 pedidos". The V2 reward can't tell the difference. The model |
| found the easier path (domain vocabulary + numbers) and stopped improving. |
| |
| **Expected outcome:** Two possibilities after the training run: |
| - SQL reward improves +20%+ → reward was the ceiling, model has capacity |
| - SQL reward still stagnates → confirmed 0.5B capacity limit for analytical reasoning |
| |
| Both are valid scientific conclusions. The point is to distinguish them. |
| |
| --- |
| |
| ### Change 4: Multi-Epoch Training (1,500 Steps) |
| |
| **What:** `MAX_STEPS = 600 → 1500`, training for ~2.5 full epochs with dataset shuffling. |
| |
| ```python |
| MAX_STEPS = 1500 # was 600 |
| SAVE_STEPS = 100 # was 50 |
| EVAL_STEPS = 50 # was 20, now more frequent per-epoch boundary |
| ``` |
| |
| **Why:** V4.1 saw only 40% of training data. The eval was still improving at step 500 |
| (best eval at step 500/600). There is signal remaining in the data. 1,500 steps covers |
| the full dataset ~2× with shuffling between passes. |
| |
| **Risk monitoring:** Watch `eval/mean_reward` vs `train/reward` divergence. If eval |
| plateaus while train keeps rising past step 800, the model is overfitting. Stop training |
| and save the pre-overfitting checkpoint. |
| |
| **Shuffling:** Ensure `shuffle=True` in the DataLoader, or use a different seed per |
| epoch for `Dataset.shuffle()`. Seeing the same prompts in the same order each epoch |
| eliminates the diversity benefit. |
| |
| **Estimated cost:** ~12 hours on L4. Run overnight. |
| |
| --- |
| |
| ### Change 5: GDPO per-reward normalization |
| |
| ```python |
| # GDPO: normalize each reward component separately before summing |
| # Instead of: reward = sum(extraction + sql + insights + push) → normalize batch |
| # Do: normalize each component independently, then sum |
| |
| def gdpo_normalize(component_rewards: dict[str, list[float]]) -> list[float]: |
| """Per-component normalization before aggregation (GDPO 2601.05242 §3.1).""" |
| normalized = {} |
| for task, rewards in component_rewards.items(): |
| rewards_t = torch.tensor(rewards, dtype=torch.float32) |
| std = rewards_t.std() |
| if std > 1e-8: |
| normalized[task] = ((rewards_t - rewards_t.mean()) / std).tolist() |
| else: |
| normalized[task] = [0.0] * len(rewards) # zero-variance group |
| # Sum normalized components per sample |
| n = len(next(iter(normalized.values()))) |
| return [sum(normalized[t][i] for t in normalized) for i in range(n)] |
| ``` |
| |
| This requires `commerce_reward_fn` to return per-component rewards (not just the sum), and |
| the trainer to call `gdpo_normalize` before computing advantages. Moderate effort — requires |
| a custom trainer subclass. |
|
|
| --- |
|
|
| ### Change 6: Dynamic Task Weighting (MT-GRPO IWU) |
|
|
| **What:** Track per-task reward improvement rates every N eval steps and increase |
| sampling probability for stagnating tasks. |
|
|
| ```python |
| class DynamicTaskSampler: |
| """ |
| MT-GRPO §3.2: Improvement-aware Weight Update. |
| Upweights tasks with stagnating reward, downweights converging tasks. |
| """ |
| def __init__(self, tasks, initial_weight=0.25, update_interval=50): |
| self.tasks = tasks |
| self.weights = {t: initial_weight for t in tasks} |
| self.history = {t: [] for t in tasks} |
| self.interval = update_interval |
| |
| def update(self, step: int, per_task_rewards: dict): |
| if step % self.interval != 0 or step == 0: |
| return |
| for task, reward in per_task_rewards.items(): |
| self.history[task].append(reward) |
| if len(self.history[task]) >= 2: |
| improvement = self.history[task][-1] - self.history[task][-2] |
| if improvement < 0.01: # stagnating |
| self.weights[task] = min(0.60, self.weights[task] * 1.3) |
| elif improvement > 0.05: # improving fast |
| self.weights[task] = max(0.10, self.weights[task] * 0.85) |
| # Normalize to sum to 1 |
| total = sum(self.weights.values()) |
| self.weights = {t: w / total for t, w in self.weights.items()} |
| |
| def sample_indices(self, dataset, n_samples: int) -> list[int]: |
| """Sample indices with probability proportional to task weights.""" |
| task_indices = {t: [] for t in self.tasks} |
| for i, record in enumerate(dataset): |
| user_txt = " ".join(m["content"] for m in record["prompt"] if m["role"] == "user") |
| task = _classify_task_type(user_txt) |
| task_indices[task].append(i) |
| |
| sampled = [] |
| for task, weight in self.weights.items(): |
| n = max(1, int(n_samples * weight)) |
| pool = task_indices.get(task, []) |
| if pool: |
| sampled.extend(random.sample(pool, min(n, len(pool)))) |
| return sampled |
| ``` |
|
|
| **Why:** SQL Q&A improved +3.8% vs insights +68% under equal weighting. MT-GRPO's |
| Theorem 1 proves GRPO allocates gradient budget to higher-variance tasks. SQL gets |
| starved. IWU corrects this by progressively increasing SQL's sampling probability as |
| its improvement rate drops below threshold. |
|
|
| **Log:** `wandb.log({"sampler/sql_weight": sampler.weights["sql_qa"], ...})` at every |
| update. You want to see SQL weight increasing from 0.40 (initial) toward 0.60+ if the |
| model is still stagnating. |
|
|
| --- |
|
|
| ### Change 7: Three Seeds for Reproducibility |
|
|
| **What:** Run V4.2 with `seed ∈ {42, 123, 456}`. Report `mean ± std` across seeds |
| for all headline metrics. |
|
|
| ```python |
| # In Cell 3: |
| SEEDS = [42, 123, 456] |
| CURRENT_SEED = 42 # change per run |
| |
| # In GRPOConfig and Dataset prep: |
| seed = CURRENT_SEED |
| random.seed(seed) |
| torch.manual_seed(seed) |
| ``` |
|
|
| **Why:** Three seeds is the minimum for a credible ML result. A single lucky or unlucky |
| random initialization can produce ±0.05 variance in eval reward. With three seeds, you |
| can report `0.645 ± 0.02` instead of just `0.645` — the former is publishable, the |
| latter is an observation. |
|
|
| **Cost:** 3 × ~12h = ~36h on L4. Do these in parallel if you have access to multiple |
| GPUs, or sequentially overnight. |
|
|
| **Report format:** |
| ``` |
| | Task | Seed 42 | Seed 123 | Seed 456 | Mean ± 95% CI | |
| |---|---|---|---|---| |
| | Extraction | ... | ... | ... | X.XX ± 0.0X | |
| | SQL Q&A | ... | ... | ... | X.XX ± 0.0X | |
| | Insights | ... | ... | ... | X.XX ± 0.0X | |
| | Push | ... | ... | ... | X.XX ± 0.0X | |
| | **Mean** | ... | ... | ... | **X.XX ± 0.0X** | |
| ``` |
|
|
| --- |
|
|
| ### Change 8: Best checkpoint saving |
|
|
| The V4.1 config had `save_total_limit=5` and `save_only_model=True` but no explicit best-checkpoint |
| logic. `GRPOTrainer` doesn't have a native `load_best_model_at_end` equivalent the way `Trainer` does. |
| The correct pattern is to track best eval reward in `EvalRewardCallback` and save explicitly when |
| it improves: |
|
|
| ```python |
| # In EvalRewardCallback.on_step_end, after computing mean_reward: |
| if improved: |
| self.best_reward = mean_reward |
| self.best_step = state.global_step |
| self.no_improve_count = 0 |
| # Save best checkpoint explicitly |
| best_path = ADAPTER_DIR / "best_checkpoint" |
| model.save_pretrained(str(best_path)) |
| tokenizer.save_pretrained(str(best_path)) |
| print(f" ✓ Best checkpoint saved → {best_path} (reward={mean_reward:.4f})") |
| ``` |
|
|
| This guarantees the adapter saved in `best_checkpoint/` is always the peak eval reward, regardless of where |
| training ends or when early stopping fires. Add this to the `EvalRewardCallback` block and the final Cell 15 |
| should load from `best_checkpoint/` rather than the last training step. |
|
|
| --- |
|
|
| ## V4.2 Config Summary |
|
|
| ```python |
| # Changes from V4.1 |
| MAX_STEPS = 1500 # was 600 |
| EVAL_STEPS = 50 # was 20 |
| SAVE_STEPS = 100 # was 50 |
| SEEDS = [42, 123, 456] # run separately |
| |
| # Structural additions (not hyperparameters) |
| eval_set = "data/pairs/eval_v2_stratified.jsonl" # 65 fixed samples |
| reward_sql_qa = reward_sql_qa_v2 # new SQL reward (Change 3) |
| dynamic_sampler = DynamicTaskSampler(...) # Change 5 |
| |
| # Everything else UNCHANGED from V4.1 (these are validated): |
| NUM_GENERATIONS = 16 |
| MAX_COMPLETION_LENGTH = 512 |
| TEMPERATURE = 1.0 |
| BETA = 0.0 |
| SCALE_REWARDS = False |
| LEARNING_RATE = 5e-6 |
| lr_scheduler_type = "constant_with_warmup" |
| warmup_ratio = 0.05 |
| BATCH_SIZE = 2 |
| GRAD_ACCUM = 1 |
| LORA_R = 16 |
| LORA_ALPHA = 32 |
| ``` |
|
|
| --- |
|
|
| ## What to Observe During V4.2 |
|
|
| ### The four questions V4.2 must answer |
|
|
| **Q1: Does SQL reward improve with the new reward function?** |
| Watch `eval/sql_qa` trajectory. If it improves +15%+ over V4.1 by step 600, the old |
| reward was the ceiling. If it still stagnates at ~0.547, the 0.5B model has a genuine |
| capacity limit on analytical reasoning. Both are valid answers — the important thing is |
| knowing which. |
|
|
| **Q2: Is the insights regression noise or forgetting?** |
| With n=15 insights samples, the swing becomes ±0.06 instead of ±0.22. If insights |
| stabilizes at 0.75-0.85 throughout multi-epoch training, it was noise. If it consistently |
| drops after step 800, it's catastrophic forgetting — and you need MT-GRPO's constrained |
| optimization to address it. |
|
|
| **Q3: Does multi-epoch training push eval above 0.70?** |
| V4.1 was still improving at step 500 with eval_best=0.645. 1,500 steps of sustained |
| training with constant LR and full data coverage should push this further. Target: ≥0.70 |
| mean across all tasks. |
| |
| **Q4: Are results reproducible across seeds?** |
| If `mean ± std` across three seeds shows std < 0.03, the result is robust. If std > 0.05, |
| there's significant initialization sensitivity that needs to be understood before claiming |
| these numbers. |
| |
| ### WandB metrics to watch in real time |
| |
| | Metric | Expected | Stop if | |
| |---|---|---| |
| | `eval/sql_qa` | Steeper upward slope vs V4.1 | Still flat at step 200 (reward still wrong) | |
| | `eval/insights` | Stable 0.75-0.85, no big swings | Drops consistently below 0.65 after step 800 | |
| | `eval/mean_reward` | Continues improving past step 500 | Plateaus at same 0.645 as V4.1 | |
| | `sampler/sql_weight` | Rising from 0.40 toward 0.55-0.60 | Stays flat (IWU not triggering) | |
| | `train/loss` vs `eval` | Both decreasing | Train drops, eval plateaus (overfitting) | |
|
|
| --- |
|
|
| ## Drawing Conclusions for the Methodology Case |
|
|
| V4.2 closes the 0.5B story. After the runs complete, the project demonstrates: |
|
|
| ### What was systematically learned across versions |
|
|
| | Version | Failure identified | Fix | Isolated contribution | |
| |---|---|---|---| |
| | V1 | 68% extraction task imbalance → task collapse | Rebalanced to 40/40/10/10 | Established balanced multi-task training | |
| | V2 | DPO near-no-op (9 gradient steps, flat loss) | Replaced DPO with GRPO | Established GRPO as the correct alignment method | |
| | V3 | Think model APO anchor + 2,628-token `<think>` overhead | Switched to Instruct model | Isolated architectural constraint | |
| | V4 | Cosine LR decay to zero by step 130 | Constant LR schedule | +370 productive training steps | |
| | V4 | JSON parser failed on PT-BR decimals | json-repair + PT normalizer | 3.25× measured extraction improvement | |
| | V4.1 | Stagnant SQL (+3.8%) | New reward function + dynamic weighting | To be measured in V4.2 | |
| | V4.1 | Single run, no CIs | 3-seed reproducibility | Statistical credibility | |
|
|
| ### The scientific argument |
|
|
| Each version was not "trying something different" — it was **testing a specific hypothesis** |
| with a controlled change. The key methodological strength is causal attribution: when |
| eval_best went from 0.476 to 0.645, the V4.1 report decomposed the sources: |
| - ~0.13 from parser fix (measured at step 20, before GRPO learning) |
| - ~0.13 from GRPO learning (measured from step 20 to peak) |
| |
| This is the scientific standard: not just "it improved" but "here is why, with evidence." |
| |
| V4.2 completes this arc by either showing SQL improvement (proving reward was the ceiling) |
| or confirming capacity limit (proving 0.5B has an irreducible floor on analytical tasks). |
| Either outcome is a clean, documentable finding. |
| |
| ### The external benchmark gap |
| |
| One gap remains for true gold-standard credibility: no external benchmark comparison. |
| The current eval suite is project-internal. For the methodology case, consider running |
| the V4.2 final checkpoint against a subset of Portuguese NLP benchmarks from the Tucano2 |
| model card (BLUEX, OAB, ENEM) on the specific knowledge domains your training covered. |
| Even partial coverage demonstrates the model didn't regress on general Portuguese ability |
| while improving on domain tasks. This is the "catastrophic forgetting" test at the |
| benchmark level, complementing the within-run insights stability test. |
| |
| --- |
| |
| ## Notebook Structure for V4.2 |
| |
| ``` |
| Cell 1: Dependencies + env vars |
| Cell 2: GPU + Unsloth + TRL verification gate |
| Cell 3: Config constants (CURRENT_SEED variable) |
| Cell 4: Load model + critical generation_config overrides |
| Cell 5: Token ID verification gate |
| Cell 6: KV cache diagnostic gate |
| Cell 7: Reward functions v2 (including reward_sql_qa_v2) |
| Cell 8: Reward function audit (30-min protocol, ρ > 0.70 gate) |
| Cell 9: Build stratified eval set (65 samples, fixed) |
| Cell 10: Dataset preparation + DynamicTaskSampler init |
| Cell 11: Smoke test (1 step, VRAM check) |
| Cell 12: Probe run (10 steps, clip_ratio > 0 on 3/10 gate) |
| Cell 13: W&B init + full training (1,500 steps) |
| Cell 14: Post-training validation (stratified 65-sample eval) |
| Cell 15: Save adapter |
| Cell 16: Results table generation (for reporting) |
| ``` |
| |
| **Run three times, changing only `CURRENT_SEED` in Cell 3.** |
| |
| --- |
| |
| *V4.2 is the last 0.5B run. Its purpose is not to find more improvement — |
| it is to know exactly what was found and why, with enough statistical rigor |
| to say so in writing.* |