tucano2-commerce / docs /v4_2-handoff.md

Create v4_2-handoff.md

d1385b0 verified 10 days ago

18.8 kB

	# V4.2 Handoff — Closing the 0.5B to Gold Standard

	Date: 2026-04-27
	Context: V4.1 achieved eval_best=0.645 (+35.5% over V4). Parser fix + constant LR
	were the two decisive changes. SQL Q&A remains stagnant (+3.8%). Insights swing
	(0.84→0.62) is eval noise, not regression. Goal: close remaining gaps before declaring
	the 0.5B scientifically complete and making the methodology case.

	---

	## What V4.1 Left Open (Ordered by Urgency)

	\| Gap \| Evidence \| Blocks \|
	\|-----\|----------\|--------\|
	\| Eval suite too small (n≈2 for insights/push) \| Insights swings ±0.22 between consecutive evals \| Every subsequent claim about per-task performance \|
	\| SQL Q&A stagnant at +3.8% \| Reward function doesn't validate SQL quality \| Knowing whether this is reward ceiling or capacity limit \|
	\| Only 40% of training data seen \| train/epoch=0.404 at step 600 \| Knowing true performance ceiling \|
	\| No reward function audit protocol \| Parser bug persisted 3 versions \| Next parser-class bug catching you again \|
	\| Single run, no confidence intervals \| No error bars on any reported number \| Credibility of the methodology case \|

	---

	## V4.2 Changes — Exactly What to Implement

	### Change 1: Expand Eval Suite to 50+ Samples (Do Before Training)

	What: Build a static eval set with minimum 15 samples per task type, stratified and
	held fixed across all future runs.

	```python
	EVAL_SAMPLES_PER_TASK = {
	"extraction": 20,
	"sql_qa": 15,
	"insights": 15,
	"push": 15,
	}
	# Total: 65 eval samples (was 15 mixed)
	```

	How: Sample from `data/pairs/eval.jsonl`, verify task distribution, save as
	`data/pairs/eval_v2_stratified.jsonl`. Never resample — the same 65 prompts must be
	used across V4.2 seeds 42, 123, 456.

	Why this is Change 1: The insights regression (0.84→0.62) made it impossible to
	distinguish learning from noise. At n≈2, standard error = ±0.22. At n=15, standard
	error drops to ±0.06. Every other change is uninterpretable without this fixed.

	Report format: `mean ± 1.96 × std/sqrt(n)` for 95% CI on each task.

	---

	### Change 2: Reward Function Audit Before Training (30-minute Protocol)

	What: Before launching any training cell, generate 20 completions (5 per task),
	manually score them 0-10, compute Spearman ρ between human scores and reward function.

	```python
	# Cell X — Reward Audit (add between calibration and training)
	from scipy.stats import spearmanr

	AUDIT_PROMPTS_PER_TASK = 5
	audit_human_scores = [] # filled manually by the person running the notebook
	audit_auto_scores = []

	# Generate completions at temp=0.1 (deterministic), print each one
	# Person assigns 0-10 score, enters into audit_human_scores list
	# Then:
	rho, p_value = spearmanr(audit_human_scores, audit_auto_scores)
	print(f"Reward function calibration: ρ={rho:.2f} (p={p_value:.3f})")
	assert rho > 0.7, f"Reward function miscalibrated (ρ={rho:.2f} < 0.70). Fix before training."
	```

	Gate: ρ > 0.70. If below, find the discrepancy (another parser bug, wrong field
	weighting, wrong task classifier) before spending GPU hours.

	Why: The V1-V4 parser bug would have been caught in 30 minutes with this protocol.
	This is the single cheapest addition to the methodology.

	---

	### Change 3: SQL Reward Overhaul

	What: Replace heuristic vocabulary matching with a validation-aware reward that
	distinguishes "mentions SQL keywords" from "produces a correct analytical answer."

	```python
	def reward_sql_qa_v2(completion: str) -> float:
	answer = strip_think(completion)
	if not answer.strip():
	return 0.0

	score = 0.0

	# Tier 1 (0.30): SQL structure detected
	sql_keywords = ["SELECT", "FROM", "WHERE", "GROUP BY", "ORDER BY", "JOIN", "HAVING", "COUNT", "AVG", "SUM"]
	sql_found = sum(1 for kw in sql_keywords if kw in answer.upper())
	if sql_found >= 3:
	score += 0.30
	elif sql_found >= 1:
	score += 0.15

	# Tier 2 (0.25): Answer has both query and explanation
	has_query = bool(re.search(r"```sql\|SELECT.{5,}FROM", answer, re.IGNORECASE \| re.DOTALL))
	has_answer = any(w in answer.lower() for w in ["resultado", "total", "média", "mostra", "portanto"])
	if has_query and has_answer:
	score += 0.25
	elif has_query or has_answer:
	score += 0.12

	# Tier 3 (0.25): Numerical specificity
	numbers = re.findall(r"\d+(?:[.,]\d+)?(?:\s*%)?", answer)
	score += min(0.25, 0.05 * len(numbers))

	# Tier 4 (0.20): Portuguese business domain coherence
	pt_domain = ["pedidos", "clientes", "vendedores", "produtos", "avaliação",
	"entrega", "reclamação", "satisfação", "categoria", "período"]
	score += min(0.20, 0.04 * sum(1 for w in pt_domain if w in answer.lower()))

	return min(score, 1.0)
	```

	Why: The current SQL reward scores a completion that says "os pedidos de clientes em
	2017 totalizaram 15.000" the same as one that says "SELECT COUNT(*) FROM orders WHERE
	year=2017 → result: 15,000 pedidos". The V2 reward can't tell the difference. The model
	found the easier path (domain vocabulary + numbers) and stopped improving.

	Expected outcome: Two possibilities after the training run:
	- SQL reward improves +20%+ → reward was the ceiling, model has capacity
	- SQL reward still stagnates → confirmed 0.5B capacity limit for analytical reasoning

	Both are valid scientific conclusions. The point is to distinguish them.

	---

	### Change 4: Multi-Epoch Training (1,500 Steps)

	What: `MAX_STEPS = 600 → 1500`, training for ~2.5 full epochs with dataset shuffling.

	```python
	MAX_STEPS = 1500 # was 600
	SAVE_STEPS = 100 # was 50
	EVAL_STEPS = 50 # was 20, now more frequent per-epoch boundary
	```

	Why: V4.1 saw only 40% of training data. The eval was still improving at step 500
	(best eval at step 500/600). There is signal remaining in the data. 1,500 steps covers
	the full dataset ~2× with shuffling between passes.

	Risk monitoring: Watch `eval/mean_reward` vs `train/reward` divergence. If eval
	plateaus while train keeps rising past step 800, the model is overfitting. Stop training
	and save the pre-overfitting checkpoint.

	Shuffling: Ensure `shuffle=True` in the DataLoader, or use a different seed per
	epoch for `Dataset.shuffle()`. Seeing the same prompts in the same order each epoch
	eliminates the diversity benefit.

	Estimated cost: ~12 hours on L4. Run overnight.

	---

	### Change 5: GDPO per-reward normalization

	```python
	# GDPO: normalize each reward component separately before summing
	# Instead of: reward = sum(extraction + sql + insights + push) → normalize batch
	# Do: normalize each component independently, then sum

	def gdpo_normalize(component_rewards: dict[str, list[float]]) -> list[float]:
	"""Per-component normalization before aggregation (GDPO 2601.05242 §3.1)."""
	normalized = {}
	for task, rewards in component_rewards.items():
	rewards_t = torch.tensor(rewards, dtype=torch.float32)
	std = rewards_t.std()
	if std > 1e-8:
	normalized[task] = ((rewards_t - rewards_t.mean()) / std).tolist()
	else:
	normalized[task] = [0.0] * len(rewards) # zero-variance group
	# Sum normalized components per sample
	n = len(next(iter(normalized.values())))
	return [sum(normalized[t][i] for t in normalized) for i in range(n)]
	```

	This requires `commerce_reward_fn` to return per-component rewards (not just the sum), and
	the trainer to call `gdpo_normalize` before computing advantages. Moderate effort — requires
	a custom trainer subclass.

	---

	### Change 6: Dynamic Task Weighting (MT-GRPO IWU)

	What: Track per-task reward improvement rates every N eval steps and increase
	sampling probability for stagnating tasks.

	```python
	class DynamicTaskSampler:
	"""
	MT-GRPO §3.2: Improvement-aware Weight Update.
	Upweights tasks with stagnating reward, downweights converging tasks.
	"""
	def __init__(self, tasks, initial_weight=0.25, update_interval=50):
	self.tasks = tasks
	self.weights = {t: initial_weight for t in tasks}
	self.history = {t: [] for t in tasks}
	self.interval = update_interval

	def update(self, step: int, per_task_rewards: dict):
	if step % self.interval != 0 or step == 0:
	return
	for task, reward in per_task_rewards.items():
	self.history[task].append(reward)
	if len(self.history[task]) >= 2:
	improvement = self.history[task][-1] - self.history[task][-2]
	if improvement < 0.01: # stagnating
	self.weights[task] = min(0.60, self.weights[task] * 1.3)
	elif improvement > 0.05: # improving fast
	self.weights[task] = max(0.10, self.weights[task] * 0.85)
	# Normalize to sum to 1
	total = sum(self.weights.values())
	self.weights = {t: w / total for t, w in self.weights.items()}

	def sample_indices(self, dataset, n_samples: int) -> list[int]:
	"""Sample indices with probability proportional to task weights."""
	task_indices = {t: [] for t in self.tasks}
	for i, record in enumerate(dataset):
	user_txt = " ".join(m["content"] for m in record["prompt"] if m["role"] == "user")
	task = _classify_task_type(user_txt)
	task_indices[task].append(i)

	sampled = []
	for task, weight in self.weights.items():
	n = max(1, int(n_samples * weight))
	pool = task_indices.get(task, [])
	if pool:
	sampled.extend(random.sample(pool, min(n, len(pool))))
	return sampled
	```

	Why: SQL Q&A improved +3.8% vs insights +68% under equal weighting. MT-GRPO's
	Theorem 1 proves GRPO allocates gradient budget to higher-variance tasks. SQL gets
	starved. IWU corrects this by progressively increasing SQL's sampling probability as
	its improvement rate drops below threshold.

	Log: `wandb.log({"sampler/sql_weight": sampler.weights["sql_qa"], ...})` at every
	update. You want to see SQL weight increasing from 0.40 (initial) toward 0.60+ if the
	model is still stagnating.

	---

	### Change 7: Three Seeds for Reproducibility

	What: Run V4.2 with `seed ∈ {42, 123, 456}`. Report `mean ± std` across seeds
	for all headline metrics.

	```python
	# In Cell 3:
	SEEDS = [42, 123, 456]
	CURRENT_SEED = 42 # change per run

	# In GRPOConfig and Dataset prep:
	seed = CURRENT_SEED
	random.seed(seed)
	torch.manual_seed(seed)
	```

	Why: Three seeds is the minimum for a credible ML result. A single lucky or unlucky
	random initialization can produce ±0.05 variance in eval reward. With three seeds, you
	can report `0.645 ± 0.02` instead of just `0.645` — the former is publishable, the
	latter is an observation.

	Cost: 3 × ~12h = ~36h on L4. Do these in parallel if you have access to multiple
	GPUs, or sequentially overnight.

	Report format:
	```
	\| Task \| Seed 42 \| Seed 123 \| Seed 456 \| Mean ± 95% CI \|
	\|---\|---\|---\|---\|---\|
	\| Extraction \| ... \| ... \| ... \| X.XX ± 0.0X \|
	\| SQL Q&A \| ... \| ... \| ... \| X.XX ± 0.0X \|
	\| Insights \| ... \| ... \| ... \| X.XX ± 0.0X \|
	\| Push \| ... \| ... \| ... \| X.XX ± 0.0X \|
	\| Mean \| ... \| ... \| ... \| X.XX ± 0.0X \|
	```

	---

	### Change 8: Best checkpoint saving

	The V4.1 config had `save_total_limit=5` and `save_only_model=True` but no explicit best-checkpoint
	logic. `GRPOTrainer` doesn't have a native `load_best_model_at_end` equivalent the way `Trainer` does.
	The correct pattern is to track best eval reward in `EvalRewardCallback` and save explicitly when
	it improves:

	```python
	# In EvalRewardCallback.on_step_end, after computing mean_reward:
	if improved:
	self.best_reward = mean_reward
	self.best_step = state.global_step
	self.no_improve_count = 0
	# Save best checkpoint explicitly
	best_path = ADAPTER_DIR / "best_checkpoint"
	model.save_pretrained(str(best_path))
	tokenizer.save_pretrained(str(best_path))
	print(f" ✓ Best checkpoint saved → {best_path} (reward={mean_reward:.4f})")
	```

	This guarantees the adapter saved in `best_checkpoint/` is always the peak eval reward, regardless of where
	training ends or when early stopping fires. Add this to the `EvalRewardCallback` block and the final Cell 15
	should load from `best_checkpoint/` rather than the last training step.

	---

	## V4.2 Config Summary

	```python
	# Changes from V4.1
	MAX_STEPS = 1500 # was 600
	EVAL_STEPS = 50 # was 20
	SAVE_STEPS = 100 # was 50
	SEEDS = [42, 123, 456] # run separately

	# Structural additions (not hyperparameters)
	eval_set = "data/pairs/eval_v2_stratified.jsonl" # 65 fixed samples
	reward_sql_qa = reward_sql_qa_v2 # new SQL reward (Change 3)
	dynamic_sampler = DynamicTaskSampler(...) # Change 5

	# Everything else UNCHANGED from V4.1 (these are validated):
	NUM_GENERATIONS = 16
	MAX_COMPLETION_LENGTH = 512
	TEMPERATURE = 1.0
	BETA = 0.0
	SCALE_REWARDS = False
	LEARNING_RATE = 5e-6
	lr_scheduler_type = "constant_with_warmup"
	warmup_ratio = 0.05
	BATCH_SIZE = 2
	GRAD_ACCUM = 1
	LORA_R = 16
	LORA_ALPHA = 32
	```

	---

	## What to Observe During V4.2

	### The four questions V4.2 must answer

	Q1: Does SQL reward improve with the new reward function?
	Watch `eval/sql_qa` trajectory. If it improves +15%+ over V4.1 by step 600, the old
	reward was the ceiling. If it still stagnates at ~0.547, the 0.5B model has a genuine
	capacity limit on analytical reasoning. Both are valid answers — the important thing is
	knowing which.

	Q2: Is the insights regression noise or forgetting?
	With n=15 insights samples, the swing becomes ±0.06 instead of ±0.22. If insights
	stabilizes at 0.75-0.85 throughout multi-epoch training, it was noise. If it consistently
	drops after step 800, it's catastrophic forgetting — and you need MT-GRPO's constrained
	optimization to address it.

	Q3: Does multi-epoch training push eval above 0.70?
	V4.1 was still improving at step 500 with eval_best=0.645. 1,500 steps of sustained
	training with constant LR and full data coverage should push this further. Target: ≥0.70
	mean across all tasks.

	Q4: Are results reproducible across seeds?
	If `mean ± std` across three seeds shows std < 0.03, the result is robust. If std > 0.05,
	there's significant initialization sensitivity that needs to be understood before claiming
	these numbers.

	### WandB metrics to watch in real time

	\| Metric \| Expected \| Stop if \|
	\|---\|---\|---\|
	\| `eval/sql_qa` \| Steeper upward slope vs V4.1 \| Still flat at step 200 (reward still wrong) \|
	\| `eval/insights` \| Stable 0.75-0.85, no big swings \| Drops consistently below 0.65 after step 800 \|
	\| `eval/mean_reward` \| Continues improving past step 500 \| Plateaus at same 0.645 as V4.1 \|
	\| `sampler/sql_weight` \| Rising from 0.40 toward 0.55-0.60 \| Stays flat (IWU not triggering) \|
	\| `train/loss` vs `eval` \| Both decreasing \| Train drops, eval plateaus (overfitting) \|

	---

	## Drawing Conclusions for the Methodology Case

	V4.2 closes the 0.5B story. After the runs complete, the project demonstrates:

	### What was systematically learned across versions

	\| Version \| Failure identified \| Fix \| Isolated contribution \|
	\|---\|---\|---\|---\|
	\| V1 \| 68% extraction task imbalance → task collapse \| Rebalanced to 40/40/10/10 \| Established balanced multi-task training \|
	\| V2 \| DPO near-no-op (9 gradient steps, flat loss) \| Replaced DPO with GRPO \| Established GRPO as the correct alignment method \|
	\| V3 \| Think model APO anchor + 2,628-token `<think>` overhead \| Switched to Instruct model \| Isolated architectural constraint \|
	\| V4 \| Cosine LR decay to zero by step 130 \| Constant LR schedule \| +370 productive training steps \|
	\| V4 \| JSON parser failed on PT-BR decimals \| json-repair + PT normalizer \| 3.25× measured extraction improvement \|
	\| V4.1 \| Stagnant SQL (+3.8%) \| New reward function + dynamic weighting \| To be measured in V4.2 \|
	\| V4.1 \| Single run, no CIs \| 3-seed reproducibility \| Statistical credibility \|

	### The scientific argument

	Each version was not "trying something different" — it was testing a specific hypothesis
	with a controlled change. The key methodological strength is causal attribution: when
	eval_best went from 0.476 to 0.645, the V4.1 report decomposed the sources:
	- ~0.13 from parser fix (measured at step 20, before GRPO learning)
	- ~0.13 from GRPO learning (measured from step 20 to peak)

	This is the scientific standard: not just "it improved" but "here is why, with evidence."

	V4.2 completes this arc by either showing SQL improvement (proving reward was the ceiling)
	or confirming capacity limit (proving 0.5B has an irreducible floor on analytical tasks).
	Either outcome is a clean, documentable finding.

	### The external benchmark gap

	One gap remains for true gold-standard credibility: no external benchmark comparison.
	The current eval suite is project-internal. For the methodology case, consider running
	the V4.2 final checkpoint against a subset of Portuguese NLP benchmarks from the Tucano2
	model card (BLUEX, OAB, ENEM) on the specific knowledge domains your training covered.
	Even partial coverage demonstrates the model didn't regress on general Portuguese ability
	while improving on domain tasks. This is the "catastrophic forgetting" test at the
	benchmark level, complementing the within-run insights stability test.

	---

	## Notebook Structure for V4.2

	```
	Cell 1: Dependencies + env vars
	Cell 2: GPU + Unsloth + TRL verification gate
	Cell 3: Config constants (CURRENT_SEED variable)
	Cell 4: Load model + critical generation_config overrides
	Cell 5: Token ID verification gate
	Cell 6: KV cache diagnostic gate
	Cell 7: Reward functions v2 (including reward_sql_qa_v2)
	Cell 8: Reward function audit (30-min protocol, ρ > 0.70 gate)
	Cell 9: Build stratified eval set (65 samples, fixed)
	Cell 10: Dataset preparation + DynamicTaskSampler init
	Cell 11: Smoke test (1 step, VRAM check)
	Cell 12: Probe run (10 steps, clip_ratio > 0 on 3/10 gate)
	Cell 13: W&B init + full training (1,500 steps)
	Cell 14: Post-training validation (stratified 65-sample eval)
	Cell 15: Save adapter
	Cell 16: Results table generation (for reporting)
	```

	Run three times, changing only `CURRENT_SEED` in Cell 3.

	---

	*V4.2 is the last 0.5B run. Its purpose is not to find more improvement —
	it is to know exactly what was found and why, with enough statistical rigor
	to say so in writing.*