tucano2-commerce / docs /v4_2-handoff.md
rtferraz's picture
Create v4_2-handoff.md
d1385b0 verified

V4.2 Handoff — Closing the 0.5B to Gold Standard

Date: 2026-04-27
Context: V4.1 achieved eval_best=0.645 (+35.5% over V4). Parser fix + constant LR were the two decisive changes. SQL Q&A remains stagnant (+3.8%). Insights swing (0.84→0.62) is eval noise, not regression. Goal: close remaining gaps before declaring the 0.5B scientifically complete and making the methodology case.


What V4.1 Left Open (Ordered by Urgency)

Gap Evidence Blocks
Eval suite too small (n≈2 for insights/push) Insights swings ±0.22 between consecutive evals Every subsequent claim about per-task performance
SQL Q&A stagnant at +3.8% Reward function doesn't validate SQL quality Knowing whether this is reward ceiling or capacity limit
Only 40% of training data seen train/epoch=0.404 at step 600 Knowing true performance ceiling
No reward function audit protocol Parser bug persisted 3 versions Next parser-class bug catching you again
Single run, no confidence intervals No error bars on any reported number Credibility of the methodology case

V4.2 Changes — Exactly What to Implement

Change 1: Expand Eval Suite to 50+ Samples (Do Before Training)

What: Build a static eval set with minimum 15 samples per task type, stratified and held fixed across all future runs.

EVAL_SAMPLES_PER_TASK = {
    "extraction": 20,
    "sql_qa":     15,
    "insights":   15,
    "push":       15,
}
# Total: 65 eval samples (was 15 mixed)

How: Sample from data/pairs/eval.jsonl, verify task distribution, save as data/pairs/eval_v2_stratified.jsonl. Never resample — the same 65 prompts must be used across V4.2 seeds 42, 123, 456.

Why this is Change 1: The insights regression (0.84→0.62) made it impossible to distinguish learning from noise. At n≈2, standard error = ±0.22. At n=15, standard error drops to ±0.06. Every other change is uninterpretable without this fixed.

Report format: mean ± 1.96 × std/sqrt(n) for 95% CI on each task.


Change 2: Reward Function Audit Before Training (30-minute Protocol)

What: Before launching any training cell, generate 20 completions (5 per task), manually score them 0-10, compute Spearman ρ between human scores and reward function.

# Cell X — Reward Audit (add between calibration and training)
from scipy.stats import spearmanr

AUDIT_PROMPTS_PER_TASK = 5
audit_human_scores = []   # filled manually by the person running the notebook
audit_auto_scores  = []

# Generate completions at temp=0.1 (deterministic), print each one
# Person assigns 0-10 score, enters into audit_human_scores list
# Then:
rho, p_value = spearmanr(audit_human_scores, audit_auto_scores)
print(f"Reward function calibration: ρ={rho:.2f} (p={p_value:.3f})")
assert rho > 0.7, f"Reward function miscalibrated (ρ={rho:.2f} < 0.70). Fix before training."

Gate: ρ > 0.70. If below, find the discrepancy (another parser bug, wrong field weighting, wrong task classifier) before spending GPU hours.

Why: The V1-V4 parser bug would have been caught in 30 minutes with this protocol. This is the single cheapest addition to the methodology.


Change 3: SQL Reward Overhaul

What: Replace heuristic vocabulary matching with a validation-aware reward that distinguishes "mentions SQL keywords" from "produces a correct analytical answer."

def reward_sql_qa_v2(completion: str) -> float:
    answer = strip_think(completion)
    if not answer.strip():
        return 0.0

    score = 0.0

    # Tier 1 (0.30): SQL structure detected
    sql_keywords = ["SELECT", "FROM", "WHERE", "GROUP BY", "ORDER BY", "JOIN", "HAVING", "COUNT", "AVG", "SUM"]
    sql_found = sum(1 for kw in sql_keywords if kw in answer.upper())
    if sql_found >= 3:
        score += 0.30
    elif sql_found >= 1:
        score += 0.15

    # Tier 2 (0.25): Answer has both query and explanation
    has_query  = bool(re.search(r"```sql|SELECT.{5,}FROM", answer, re.IGNORECASE | re.DOTALL))
    has_answer = any(w in answer.lower() for w in ["resultado", "total", "média", "mostra", "portanto"])
    if has_query and has_answer:
        score += 0.25
    elif has_query or has_answer:
        score += 0.12

    # Tier 3 (0.25): Numerical specificity
    numbers = re.findall(r"\d+(?:[.,]\d+)?(?:\s*%)?", answer)
    score += min(0.25, 0.05 * len(numbers))

    # Tier 4 (0.20): Portuguese business domain coherence
    pt_domain = ["pedidos", "clientes", "vendedores", "produtos", "avaliação",
                 "entrega", "reclamação", "satisfação", "categoria", "período"]
    score += min(0.20, 0.04 * sum(1 for w in pt_domain if w in answer.lower()))

    return min(score, 1.0)

Why: The current SQL reward scores a completion that says "os pedidos de clientes em 2017 totalizaram 15.000" the same as one that says "SELECT COUNT(*) FROM orders WHERE year=2017 → result: 15,000 pedidos". The V2 reward can't tell the difference. The model found the easier path (domain vocabulary + numbers) and stopped improving.

Expected outcome: Two possibilities after the training run:

  • SQL reward improves +20%+ → reward was the ceiling, model has capacity
  • SQL reward still stagnates → confirmed 0.5B capacity limit for analytical reasoning

Both are valid scientific conclusions. The point is to distinguish them.


Change 4: Multi-Epoch Training (1,500 Steps)

What: MAX_STEPS = 600 → 1500, training for ~2.5 full epochs with dataset shuffling.

MAX_STEPS        = 1500    # was 600
SAVE_STEPS       = 100     # was 50
EVAL_STEPS       = 50      # was 20, now more frequent per-epoch boundary

Why: V4.1 saw only 40% of training data. The eval was still improving at step 500 (best eval at step 500/600). There is signal remaining in the data. 1,500 steps covers the full dataset ~2× with shuffling between passes.

Risk monitoring: Watch eval/mean_reward vs train/reward divergence. If eval plateaus while train keeps rising past step 800, the model is overfitting. Stop training and save the pre-overfitting checkpoint.

Shuffling: Ensure shuffle=True in the DataLoader, or use a different seed per epoch for Dataset.shuffle(). Seeing the same prompts in the same order each epoch eliminates the diversity benefit.

Estimated cost: ~12 hours on L4. Run overnight.


Change 5: GDPO per-reward normalization

# GDPO: normalize each reward component separately before summing
# Instead of: reward = sum(extraction + sql + insights + push) → normalize batch
# Do:         normalize each component independently, then sum

def gdpo_normalize(component_rewards: dict[str, list[float]]) -> list[float]:
    """Per-component normalization before aggregation (GDPO 2601.05242 §3.1)."""
    normalized = {}
    for task, rewards in component_rewards.items():
        rewards_t = torch.tensor(rewards, dtype=torch.float32)
        std = rewards_t.std()
        if std > 1e-8:
            normalized[task] = ((rewards_t - rewards_t.mean()) / std).tolist()
        else:
            normalized[task] = [0.0] * len(rewards)  # zero-variance group
    # Sum normalized components per sample
    n = len(next(iter(normalized.values())))
    return [sum(normalized[t][i] for t in normalized) for i in range(n)]

This requires commerce_reward_fn to return per-component rewards (not just the sum), and the trainer to call gdpo_normalize before computing advantages. Moderate effort — requires a custom trainer subclass.


Change 6: Dynamic Task Weighting (MT-GRPO IWU)

What: Track per-task reward improvement rates every N eval steps and increase sampling probability for stagnating tasks.

class DynamicTaskSampler:
    """
    MT-GRPO §3.2: Improvement-aware Weight Update.
    Upweights tasks with stagnating reward, downweights converging tasks.
    """
    def __init__(self, tasks, initial_weight=0.25, update_interval=50):
        self.tasks    = tasks
        self.weights  = {t: initial_weight for t in tasks}
        self.history  = {t: [] for t in tasks}
        self.interval = update_interval

    def update(self, step: int, per_task_rewards: dict):
        if step % self.interval != 0 or step == 0:
            return
        for task, reward in per_task_rewards.items():
            self.history[task].append(reward)
            if len(self.history[task]) >= 2:
                improvement = self.history[task][-1] - self.history[task][-2]
                if improvement < 0.01:         # stagnating
                    self.weights[task] = min(0.60, self.weights[task] * 1.3)
                elif improvement > 0.05:       # improving fast
                    self.weights[task] = max(0.10, self.weights[task] * 0.85)
        # Normalize to sum to 1
        total = sum(self.weights.values())
        self.weights = {t: w / total for t, w in self.weights.items()}

    def sample_indices(self, dataset, n_samples: int) -> list[int]:
        """Sample indices with probability proportional to task weights."""
        task_indices = {t: [] for t in self.tasks}
        for i, record in enumerate(dataset):
            user_txt = " ".join(m["content"] for m in record["prompt"] if m["role"] == "user")
            task = _classify_task_type(user_txt)
            task_indices[task].append(i)

        sampled = []
        for task, weight in self.weights.items():
            n = max(1, int(n_samples * weight))
            pool = task_indices.get(task, [])
            if pool:
                sampled.extend(random.sample(pool, min(n, len(pool))))
        return sampled

Why: SQL Q&A improved +3.8% vs insights +68% under equal weighting. MT-GRPO's Theorem 1 proves GRPO allocates gradient budget to higher-variance tasks. SQL gets starved. IWU corrects this by progressively increasing SQL's sampling probability as its improvement rate drops below threshold.

Log: wandb.log({"sampler/sql_weight": sampler.weights["sql_qa"], ...}) at every update. You want to see SQL weight increasing from 0.40 (initial) toward 0.60+ if the model is still stagnating.


Change 7: Three Seeds for Reproducibility

What: Run V4.2 with seed ∈ {42, 123, 456}. Report mean ± std across seeds for all headline metrics.

# In Cell 3:
SEEDS = [42, 123, 456]
CURRENT_SEED = 42   # change per run

# In GRPOConfig and Dataset prep:
seed = CURRENT_SEED
random.seed(seed)
torch.manual_seed(seed)

Why: Three seeds is the minimum for a credible ML result. A single lucky or unlucky random initialization can produce ±0.05 variance in eval reward. With three seeds, you can report 0.645 ± 0.02 instead of just 0.645 — the former is publishable, the latter is an observation.

Cost: 3 × ~12h = ~36h on L4. Do these in parallel if you have access to multiple GPUs, or sequentially overnight.

Report format:

| Task         | Seed 42 | Seed 123 | Seed 456 | Mean ± 95% CI |
|---|---|---|---|---|
| Extraction   | ...     | ...      | ...      | X.XX ± 0.0X   |
| SQL Q&A      | ...     | ...      | ...      | X.XX ± 0.0X   |
| Insights     | ...     | ...      | ...      | X.XX ± 0.0X   |
| Push         | ...     | ...      | ...      | X.XX ± 0.0X   |
| **Mean**     | ...     | ...      | ...      | **X.XX ± 0.0X** |

Change 8: Best checkpoint saving

The V4.1 config had save_total_limit=5 and save_only_model=True but no explicit best-checkpoint logic. GRPOTrainer doesn't have a native load_best_model_at_end equivalent the way Trainer does. The correct pattern is to track best eval reward in EvalRewardCallback and save explicitly when it improves:

# In EvalRewardCallback.on_step_end, after computing mean_reward:
if improved:
    self.best_reward = mean_reward
    self.best_step   = state.global_step
    self.no_improve_count = 0
    # Save best checkpoint explicitly
    best_path = ADAPTER_DIR / "best_checkpoint"
    model.save_pretrained(str(best_path))
    tokenizer.save_pretrained(str(best_path))
    print(f"  ✓ Best checkpoint saved → {best_path} (reward={mean_reward:.4f})")

This guarantees the adapter saved in best_checkpoint/ is always the peak eval reward, regardless of where training ends or when early stopping fires. Add this to the EvalRewardCallback block and the final Cell 15 should load from best_checkpoint/ rather than the last training step.


V4.2 Config Summary

# Changes from V4.1
MAX_STEPS                  = 1500     # was 600
EVAL_STEPS                 = 50       # was 20
SAVE_STEPS                 = 100      # was 50
SEEDS                      = [42, 123, 456]  # run separately

# Structural additions (not hyperparameters)
eval_set                   = "data/pairs/eval_v2_stratified.jsonl"  # 65 fixed samples
reward_sql_qa              = reward_sql_qa_v2  # new SQL reward (Change 3)
dynamic_sampler            = DynamicTaskSampler(...)  # Change 5

# Everything else UNCHANGED from V4.1 (these are validated):
NUM_GENERATIONS            = 16
MAX_COMPLETION_LENGTH      = 512
TEMPERATURE                = 1.0
BETA                       = 0.0
SCALE_REWARDS              = False
LEARNING_RATE              = 5e-6
lr_scheduler_type          = "constant_with_warmup"
warmup_ratio               = 0.05
BATCH_SIZE                 = 2
GRAD_ACCUM                 = 1
LORA_R                     = 16
LORA_ALPHA                 = 32

What to Observe During V4.2

The four questions V4.2 must answer

Q1: Does SQL reward improve with the new reward function?
Watch eval/sql_qa trajectory. If it improves +15%+ over V4.1 by step 600, the old reward was the ceiling. If it still stagnates at ~0.547, the 0.5B model has a genuine capacity limit on analytical reasoning. Both are valid answers — the important thing is knowing which.

Q2: Is the insights regression noise or forgetting?
With n=15 insights samples, the swing becomes ±0.06 instead of ±0.22. If insights stabilizes at 0.75-0.85 throughout multi-epoch training, it was noise. If it consistently drops after step 800, it's catastrophic forgetting — and you need MT-GRPO's constrained optimization to address it.

Q3: Does multi-epoch training push eval above 0.70?
V4.1 was still improving at step 500 with eval_best=0.645. 1,500 steps of sustained training with constant LR and full data coverage should push this further. Target: ≥0.70 mean across all tasks.

Q4: Are results reproducible across seeds?
If mean ± std across three seeds shows std < 0.03, the result is robust. If std > 0.05, there's significant initialization sensitivity that needs to be understood before claiming these numbers.

WandB metrics to watch in real time

Metric Expected Stop if
eval/sql_qa Steeper upward slope vs V4.1 Still flat at step 200 (reward still wrong)
eval/insights Stable 0.75-0.85, no big swings Drops consistently below 0.65 after step 800
eval/mean_reward Continues improving past step 500 Plateaus at same 0.645 as V4.1
sampler/sql_weight Rising from 0.40 toward 0.55-0.60 Stays flat (IWU not triggering)
train/loss vs eval Both decreasing Train drops, eval plateaus (overfitting)

Drawing Conclusions for the Methodology Case

V4.2 closes the 0.5B story. After the runs complete, the project demonstrates:

What was systematically learned across versions

Version Failure identified Fix Isolated contribution
V1 68% extraction task imbalance → task collapse Rebalanced to 40/40/10/10 Established balanced multi-task training
V2 DPO near-no-op (9 gradient steps, flat loss) Replaced DPO with GRPO Established GRPO as the correct alignment method
V3 Think model APO anchor + 2,628-token <think> overhead Switched to Instruct model Isolated architectural constraint
V4 Cosine LR decay to zero by step 130 Constant LR schedule +370 productive training steps
V4 JSON parser failed on PT-BR decimals json-repair + PT normalizer 3.25× measured extraction improvement
V4.1 Stagnant SQL (+3.8%) New reward function + dynamic weighting To be measured in V4.2
V4.1 Single run, no CIs 3-seed reproducibility Statistical credibility

The scientific argument

Each version was not "trying something different" — it was testing a specific hypothesis with a controlled change. The key methodological strength is causal attribution: when eval_best went from 0.476 to 0.645, the V4.1 report decomposed the sources:

  • ~0.13 from parser fix (measured at step 20, before GRPO learning)
  • ~0.13 from GRPO learning (measured from step 20 to peak)

This is the scientific standard: not just "it improved" but "here is why, with evidence."

V4.2 completes this arc by either showing SQL improvement (proving reward was the ceiling) or confirming capacity limit (proving 0.5B has an irreducible floor on analytical tasks). Either outcome is a clean, documentable finding.

The external benchmark gap

One gap remains for true gold-standard credibility: no external benchmark comparison. The current eval suite is project-internal. For the methodology case, consider running the V4.2 final checkpoint against a subset of Portuguese NLP benchmarks from the Tucano2 model card (BLUEX, OAB, ENEM) on the specific knowledge domains your training covered. Even partial coverage demonstrates the model didn't regress on general Portuguese ability while improving on domain tasks. This is the "catastrophic forgetting" test at the benchmark level, complementing the within-run insights stability test.


Notebook Structure for V4.2

Cell  1: Dependencies + env vars
Cell  2: GPU + Unsloth + TRL verification gate
Cell  3: Config constants (CURRENT_SEED variable)
Cell  4: Load model + critical generation_config overrides
Cell  5: Token ID verification gate
Cell  6: KV cache diagnostic gate
Cell  7: Reward functions v2 (including reward_sql_qa_v2)
Cell  8: Reward function audit (30-min protocol, ρ > 0.70 gate)
Cell  9: Build stratified eval set (65 samples, fixed)
Cell 10: Dataset preparation + DynamicTaskSampler init
Cell 11: Smoke test (1 step, VRAM check)
Cell 12: Probe run (10 steps, clip_ratio > 0 on 3/10 gate)
Cell 13: W&B init + full training (1,500 steps)
Cell 14: Post-training validation (stratified 65-sample eval)
Cell 15: Save adapter
Cell 16: Results table generation (for reporting)

Run three times, changing only CURRENT_SEED in Cell 3.


V4.2 is the last 0.5B run. Its purpose is not to find more improvement — it is to know exactly what was found and why, with enough statistical rigor to say so in writing.