tucano2-commerce / docs /v4_2-handoff.md
rtferraz's picture
Create v4_2-handoff.md
d1385b0 verified
# V4.2 Handoff — Closing the 0.5B to Gold Standard
**Date:** 2026-04-27
**Context:** V4.1 achieved eval_best=0.645 (+35.5% over V4). Parser fix + constant LR
were the two decisive changes. SQL Q&A remains stagnant (+3.8%). Insights swing
(0.84→0.62) is eval noise, not regression. Goal: close remaining gaps before declaring
the 0.5B scientifically complete and making the methodology case.
---
## What V4.1 Left Open (Ordered by Urgency)
| Gap | Evidence | Blocks |
|-----|----------|--------|
| Eval suite too small (n≈2 for insights/push) | Insights swings ±0.22 between consecutive evals | Every subsequent claim about per-task performance |
| SQL Q&A stagnant at +3.8% | Reward function doesn't validate SQL quality | Knowing whether this is reward ceiling or capacity limit |
| Only 40% of training data seen | train/epoch=0.404 at step 600 | Knowing true performance ceiling |
| No reward function audit protocol | Parser bug persisted 3 versions | Next parser-class bug catching you again |
| Single run, no confidence intervals | No error bars on any reported number | Credibility of the methodology case |
---
## V4.2 Changes — Exactly What to Implement
### Change 1: Expand Eval Suite to 50+ Samples (Do Before Training)
**What:** Build a static eval set with minimum 15 samples per task type, stratified and
held fixed across all future runs.
```python
EVAL_SAMPLES_PER_TASK = {
"extraction": 20,
"sql_qa": 15,
"insights": 15,
"push": 15,
}
# Total: 65 eval samples (was 15 mixed)
```
**How:** Sample from `data/pairs/eval.jsonl`, verify task distribution, save as
`data/pairs/eval_v2_stratified.jsonl`. Never resample — the same 65 prompts must be
used across V4.2 seeds 42, 123, 456.
**Why this is Change 1:** The insights regression (0.84→0.62) made it impossible to
distinguish learning from noise. At n≈2, standard error = ±0.22. At n=15, standard
error drops to ±0.06. Every other change is uninterpretable without this fixed.
**Report format:** `mean ± 1.96 × std/sqrt(n)` for 95% CI on each task.
---
### Change 2: Reward Function Audit Before Training (30-minute Protocol)
**What:** Before launching any training cell, generate 20 completions (5 per task),
manually score them 0-10, compute Spearman ρ between human scores and reward function.
```python
# Cell X — Reward Audit (add between calibration and training)
from scipy.stats import spearmanr
AUDIT_PROMPTS_PER_TASK = 5
audit_human_scores = [] # filled manually by the person running the notebook
audit_auto_scores = []
# Generate completions at temp=0.1 (deterministic), print each one
# Person assigns 0-10 score, enters into audit_human_scores list
# Then:
rho, p_value = spearmanr(audit_human_scores, audit_auto_scores)
print(f"Reward function calibration: ρ={rho:.2f} (p={p_value:.3f})")
assert rho > 0.7, f"Reward function miscalibrated (ρ={rho:.2f} < 0.70). Fix before training."
```
**Gate:** ρ > 0.70. If below, find the discrepancy (another parser bug, wrong field
weighting, wrong task classifier) before spending GPU hours.
**Why:** The V1-V4 parser bug would have been caught in 30 minutes with this protocol.
This is the single cheapest addition to the methodology.
---
### Change 3: SQL Reward Overhaul
**What:** Replace heuristic vocabulary matching with a validation-aware reward that
distinguishes "mentions SQL keywords" from "produces a correct analytical answer."
```python
def reward_sql_qa_v2(completion: str) -> float:
answer = strip_think(completion)
if not answer.strip():
return 0.0
score = 0.0
# Tier 1 (0.30): SQL structure detected
sql_keywords = ["SELECT", "FROM", "WHERE", "GROUP BY", "ORDER BY", "JOIN", "HAVING", "COUNT", "AVG", "SUM"]
sql_found = sum(1 for kw in sql_keywords if kw in answer.upper())
if sql_found >= 3:
score += 0.30
elif sql_found >= 1:
score += 0.15
# Tier 2 (0.25): Answer has both query and explanation
has_query = bool(re.search(r"```sql|SELECT.{5,}FROM", answer, re.IGNORECASE | re.DOTALL))
has_answer = any(w in answer.lower() for w in ["resultado", "total", "média", "mostra", "portanto"])
if has_query and has_answer:
score += 0.25
elif has_query or has_answer:
score += 0.12
# Tier 3 (0.25): Numerical specificity
numbers = re.findall(r"\d+(?:[.,]\d+)?(?:\s*%)?", answer)
score += min(0.25, 0.05 * len(numbers))
# Tier 4 (0.20): Portuguese business domain coherence
pt_domain = ["pedidos", "clientes", "vendedores", "produtos", "avaliação",
"entrega", "reclamação", "satisfação", "categoria", "período"]
score += min(0.20, 0.04 * sum(1 for w in pt_domain if w in answer.lower()))
return min(score, 1.0)
```
**Why:** The current SQL reward scores a completion that says "os pedidos de clientes em
2017 totalizaram 15.000" the same as one that says "SELECT COUNT(*) FROM orders WHERE
year=2017 → result: 15,000 pedidos". The V2 reward can't tell the difference. The model
found the easier path (domain vocabulary + numbers) and stopped improving.
**Expected outcome:** Two possibilities after the training run:
- SQL reward improves +20%+ → reward was the ceiling, model has capacity
- SQL reward still stagnates → confirmed 0.5B capacity limit for analytical reasoning
Both are valid scientific conclusions. The point is to distinguish them.
---
### Change 4: Multi-Epoch Training (1,500 Steps)
**What:** `MAX_STEPS = 600 → 1500`, training for ~2.5 full epochs with dataset shuffling.
```python
MAX_STEPS = 1500 # was 600
SAVE_STEPS = 100 # was 50
EVAL_STEPS = 50 # was 20, now more frequent per-epoch boundary
```
**Why:** V4.1 saw only 40% of training data. The eval was still improving at step 500
(best eval at step 500/600). There is signal remaining in the data. 1,500 steps covers
the full dataset ~2× with shuffling between passes.
**Risk monitoring:** Watch `eval/mean_reward` vs `train/reward` divergence. If eval
plateaus while train keeps rising past step 800, the model is overfitting. Stop training
and save the pre-overfitting checkpoint.
**Shuffling:** Ensure `shuffle=True` in the DataLoader, or use a different seed per
epoch for `Dataset.shuffle()`. Seeing the same prompts in the same order each epoch
eliminates the diversity benefit.
**Estimated cost:** ~12 hours on L4. Run overnight.
---
### Change 5: GDPO per-reward normalization
```python
# GDPO: normalize each reward component separately before summing
# Instead of: reward = sum(extraction + sql + insights + push) → normalize batch
# Do: normalize each component independently, then sum
def gdpo_normalize(component_rewards: dict[str, list[float]]) -> list[float]:
"""Per-component normalization before aggregation (GDPO 2601.05242 §3.1)."""
normalized = {}
for task, rewards in component_rewards.items():
rewards_t = torch.tensor(rewards, dtype=torch.float32)
std = rewards_t.std()
if std > 1e-8:
normalized[task] = ((rewards_t - rewards_t.mean()) / std).tolist()
else:
normalized[task] = [0.0] * len(rewards) # zero-variance group
# Sum normalized components per sample
n = len(next(iter(normalized.values())))
return [sum(normalized[t][i] for t in normalized) for i in range(n)]
```
This requires `commerce_reward_fn` to return per-component rewards (not just the sum), and
the trainer to call `gdpo_normalize` before computing advantages. Moderate effort — requires
a custom trainer subclass.
---
### Change 6: Dynamic Task Weighting (MT-GRPO IWU)
**What:** Track per-task reward improvement rates every N eval steps and increase
sampling probability for stagnating tasks.
```python
class DynamicTaskSampler:
"""
MT-GRPO §3.2: Improvement-aware Weight Update.
Upweights tasks with stagnating reward, downweights converging tasks.
"""
def __init__(self, tasks, initial_weight=0.25, update_interval=50):
self.tasks = tasks
self.weights = {t: initial_weight for t in tasks}
self.history = {t: [] for t in tasks}
self.interval = update_interval
def update(self, step: int, per_task_rewards: dict):
if step % self.interval != 0 or step == 0:
return
for task, reward in per_task_rewards.items():
self.history[task].append(reward)
if len(self.history[task]) >= 2:
improvement = self.history[task][-1] - self.history[task][-2]
if improvement < 0.01: # stagnating
self.weights[task] = min(0.60, self.weights[task] * 1.3)
elif improvement > 0.05: # improving fast
self.weights[task] = max(0.10, self.weights[task] * 0.85)
# Normalize to sum to 1
total = sum(self.weights.values())
self.weights = {t: w / total for t, w in self.weights.items()}
def sample_indices(self, dataset, n_samples: int) -> list[int]:
"""Sample indices with probability proportional to task weights."""
task_indices = {t: [] for t in self.tasks}
for i, record in enumerate(dataset):
user_txt = " ".join(m["content"] for m in record["prompt"] if m["role"] == "user")
task = _classify_task_type(user_txt)
task_indices[task].append(i)
sampled = []
for task, weight in self.weights.items():
n = max(1, int(n_samples * weight))
pool = task_indices.get(task, [])
if pool:
sampled.extend(random.sample(pool, min(n, len(pool))))
return sampled
```
**Why:** SQL Q&A improved +3.8% vs insights +68% under equal weighting. MT-GRPO's
Theorem 1 proves GRPO allocates gradient budget to higher-variance tasks. SQL gets
starved. IWU corrects this by progressively increasing SQL's sampling probability as
its improvement rate drops below threshold.
**Log:** `wandb.log({"sampler/sql_weight": sampler.weights["sql_qa"], ...})` at every
update. You want to see SQL weight increasing from 0.40 (initial) toward 0.60+ if the
model is still stagnating.
---
### Change 7: Three Seeds for Reproducibility
**What:** Run V4.2 with `seed ∈ {42, 123, 456}`. Report `mean ± std` across seeds
for all headline metrics.
```python
# In Cell 3:
SEEDS = [42, 123, 456]
CURRENT_SEED = 42 # change per run
# In GRPOConfig and Dataset prep:
seed = CURRENT_SEED
random.seed(seed)
torch.manual_seed(seed)
```
**Why:** Three seeds is the minimum for a credible ML result. A single lucky or unlucky
random initialization can produce ±0.05 variance in eval reward. With three seeds, you
can report `0.645 ± 0.02` instead of just `0.645` — the former is publishable, the
latter is an observation.
**Cost:** 3 × ~12h = ~36h on L4. Do these in parallel if you have access to multiple
GPUs, or sequentially overnight.
**Report format:**
```
| Task | Seed 42 | Seed 123 | Seed 456 | Mean ± 95% CI |
|---|---|---|---|---|
| Extraction | ... | ... | ... | X.XX ± 0.0X |
| SQL Q&A | ... | ... | ... | X.XX ± 0.0X |
| Insights | ... | ... | ... | X.XX ± 0.0X |
| Push | ... | ... | ... | X.XX ± 0.0X |
| **Mean** | ... | ... | ... | **X.XX ± 0.0X** |
```
---
### Change 8: Best checkpoint saving
The V4.1 config had `save_total_limit=5` and `save_only_model=True` but no explicit best-checkpoint
logic. `GRPOTrainer` doesn't have a native `load_best_model_at_end` equivalent the way `Trainer` does.
The correct pattern is to track best eval reward in `EvalRewardCallback` and save explicitly when
it improves:
```python
# In EvalRewardCallback.on_step_end, after computing mean_reward:
if improved:
self.best_reward = mean_reward
self.best_step = state.global_step
self.no_improve_count = 0
# Save best checkpoint explicitly
best_path = ADAPTER_DIR / "best_checkpoint"
model.save_pretrained(str(best_path))
tokenizer.save_pretrained(str(best_path))
print(f" ✓ Best checkpoint saved → {best_path} (reward={mean_reward:.4f})")
```
This guarantees the adapter saved in `best_checkpoint/` is always the peak eval reward, regardless of where
training ends or when early stopping fires. Add this to the `EvalRewardCallback` block and the final Cell 15
should load from `best_checkpoint/` rather than the last training step.
---
## V4.2 Config Summary
```python
# Changes from V4.1
MAX_STEPS = 1500 # was 600
EVAL_STEPS = 50 # was 20
SAVE_STEPS = 100 # was 50
SEEDS = [42, 123, 456] # run separately
# Structural additions (not hyperparameters)
eval_set = "data/pairs/eval_v2_stratified.jsonl" # 65 fixed samples
reward_sql_qa = reward_sql_qa_v2 # new SQL reward (Change 3)
dynamic_sampler = DynamicTaskSampler(...) # Change 5
# Everything else UNCHANGED from V4.1 (these are validated):
NUM_GENERATIONS = 16
MAX_COMPLETION_LENGTH = 512
TEMPERATURE = 1.0
BETA = 0.0
SCALE_REWARDS = False
LEARNING_RATE = 5e-6
lr_scheduler_type = "constant_with_warmup"
warmup_ratio = 0.05
BATCH_SIZE = 2
GRAD_ACCUM = 1
LORA_R = 16
LORA_ALPHA = 32
```
---
## What to Observe During V4.2
### The four questions V4.2 must answer
**Q1: Does SQL reward improve with the new reward function?**
Watch `eval/sql_qa` trajectory. If it improves +15%+ over V4.1 by step 600, the old
reward was the ceiling. If it still stagnates at ~0.547, the 0.5B model has a genuine
capacity limit on analytical reasoning. Both are valid answers — the important thing is
knowing which.
**Q2: Is the insights regression noise or forgetting?**
With n=15 insights samples, the swing becomes ±0.06 instead of ±0.22. If insights
stabilizes at 0.75-0.85 throughout multi-epoch training, it was noise. If it consistently
drops after step 800, it's catastrophic forgetting — and you need MT-GRPO's constrained
optimization to address it.
**Q3: Does multi-epoch training push eval above 0.70?**
V4.1 was still improving at step 500 with eval_best=0.645. 1,500 steps of sustained
training with constant LR and full data coverage should push this further. Target: ≥0.70
mean across all tasks.
**Q4: Are results reproducible across seeds?**
If `mean ± std` across three seeds shows std < 0.03, the result is robust. If std > 0.05,
there's significant initialization sensitivity that needs to be understood before claiming
these numbers.
### WandB metrics to watch in real time
| Metric | Expected | Stop if |
|---|---|---|
| `eval/sql_qa` | Steeper upward slope vs V4.1 | Still flat at step 200 (reward still wrong) |
| `eval/insights` | Stable 0.75-0.85, no big swings | Drops consistently below 0.65 after step 800 |
| `eval/mean_reward` | Continues improving past step 500 | Plateaus at same 0.645 as V4.1 |
| `sampler/sql_weight` | Rising from 0.40 toward 0.55-0.60 | Stays flat (IWU not triggering) |
| `train/loss` vs `eval` | Both decreasing | Train drops, eval plateaus (overfitting) |
---
## Drawing Conclusions for the Methodology Case
V4.2 closes the 0.5B story. After the runs complete, the project demonstrates:
### What was systematically learned across versions
| Version | Failure identified | Fix | Isolated contribution |
|---|---|---|---|
| V1 | 68% extraction task imbalance → task collapse | Rebalanced to 40/40/10/10 | Established balanced multi-task training |
| V2 | DPO near-no-op (9 gradient steps, flat loss) | Replaced DPO with GRPO | Established GRPO as the correct alignment method |
| V3 | Think model APO anchor + 2,628-token `<think>` overhead | Switched to Instruct model | Isolated architectural constraint |
| V4 | Cosine LR decay to zero by step 130 | Constant LR schedule | +370 productive training steps |
| V4 | JSON parser failed on PT-BR decimals | json-repair + PT normalizer | 3.25× measured extraction improvement |
| V4.1 | Stagnant SQL (+3.8%) | New reward function + dynamic weighting | To be measured in V4.2 |
| V4.1 | Single run, no CIs | 3-seed reproducibility | Statistical credibility |
### The scientific argument
Each version was not "trying something different" — it was **testing a specific hypothesis**
with a controlled change. The key methodological strength is causal attribution: when
eval_best went from 0.476 to 0.645, the V4.1 report decomposed the sources:
- ~0.13 from parser fix (measured at step 20, before GRPO learning)
- ~0.13 from GRPO learning (measured from step 20 to peak)
This is the scientific standard: not just "it improved" but "here is why, with evidence."
V4.2 completes this arc by either showing SQL improvement (proving reward was the ceiling)
or confirming capacity limit (proving 0.5B has an irreducible floor on analytical tasks).
Either outcome is a clean, documentable finding.
### The external benchmark gap
One gap remains for true gold-standard credibility: no external benchmark comparison.
The current eval suite is project-internal. For the methodology case, consider running
the V4.2 final checkpoint against a subset of Portuguese NLP benchmarks from the Tucano2
model card (BLUEX, OAB, ENEM) on the specific knowledge domains your training covered.
Even partial coverage demonstrates the model didn't regress on general Portuguese ability
while improving on domain tasks. This is the "catastrophic forgetting" test at the
benchmark level, complementing the within-run insights stability test.
---
## Notebook Structure for V4.2
```
Cell 1: Dependencies + env vars
Cell 2: GPU + Unsloth + TRL verification gate
Cell 3: Config constants (CURRENT_SEED variable)
Cell 4: Load model + critical generation_config overrides
Cell 5: Token ID verification gate
Cell 6: KV cache diagnostic gate
Cell 7: Reward functions v2 (including reward_sql_qa_v2)
Cell 8: Reward function audit (30-min protocol, ρ > 0.70 gate)
Cell 9: Build stratified eval set (65 samples, fixed)
Cell 10: Dataset preparation + DynamicTaskSampler init
Cell 11: Smoke test (1 step, VRAM check)
Cell 12: Probe run (10 steps, clip_ratio > 0 on 3/10 gate)
Cell 13: W&B init + full training (1,500 steps)
Cell 14: Post-training validation (stratified 65-sample eval)
Cell 15: Save adapter
Cell 16: Results table generation (for reporting)
```
**Run three times, changing only `CURRENT_SEED` in Cell 3.**
---
*V4.2 is the last 0.5B run. Its purpose is not to find more improvement —
it is to know exactly what was found and why, with enough statistical rigor
to say so in writing.*