tucano2-commerce / docs /v4_1-handoff.md

Create v4_1-handoff.md

958f6d7 verified 11 days ago

10.8 kB

	# V4.1 / V4.2 Handoff

	Date: 2026-04-27
	Context: V4 completed 200 steps, eval_best=0.476 at step 130. Training plateaued due to
	LR decay and data starvation (13.5% of one epoch seen). Model learned (+33% over SFT baseline).
	Recipe is validated. V4.1 squeezes the 0.5B before scaling to 3.7B.

	---

	## V4.1 — Four Changes Only

	### Change 1: Fix the JSON Parser

	What: Replace the regex-based `_extract_json()` in the reward functions with a robust
	parser that handles Portuguese decimal commas and LLM formatting quirks.

	Why: The model is writing `"sentiment_score": 4,5` (correct PT-BR format). Your parser
	calls `json.loads()`, which fails, and scores the completion near-zero. The model is being
	penalized for correct behavior. This is reward misspecification — the single highest-ROI fix
	available at zero training cost.

	Do this before launching any training run. Run it against `data/pairs/eval.jsonl` first
	and measure how many previously-failing extractions now parse correctly.

	```python
	import json, re

	def _normalize_pt_decimals(s: str) -> str:
	"""Convert PT-BR decimals (4,5) to JSON-valid (4.5), only outside quoted strings."""
	result, in_string, escape_next = [], False, False
	i = 0
	while i < len(s):
	c = s[i]
	if escape_next:
	result.append(c); escape_next = False; i += 1; continue
	if c == '\\' and in_string:
	result.append(c); escape_next = True; i += 1; continue
	if c == '"':
	in_string = not in_string; result.append(c); i += 1; continue
	if not in_string:
	m = re.match(r'(\d+),(\d+)', s[i:])
	if m:
	result.append(m.group(1) + '.' + m.group(2))
	i += len(m.group(0)); continue
	result.append(c); i += 1
	return ''.join(result)

	def _extract_json(text: str) -> dict \| None:
	stripped = re.sub(r'^```(?:json)?\s\|\s```$', '', text.strip(), flags=re.MULTILINE).strip()
	for attempt in [stripped, _normalize_pt_decimals(stripped)]:
	try:
	result = json.loads(attempt)
	if isinstance(result, dict):
	return result
	except (json.JSONDecodeError, TypeError):
	pass
	# Try finding first {...} block
	match = re.search(r'\{[\s\S]*\}', attempt)
	if match:
	try:
	result = json.loads(match.group())
	if isinstance(result, dict):
	return result
	except (json.JSONDecodeError, TypeError):
	pass
	return None
	```

	---

	### Change 2: Triple the Training Steps

	What: `MAX_STEPS = 200 → 600`

	Why: V4 trained on 13.5% of available data. Of ~1,480 training prompts, only ~400 were
	sampled. The insights (148 prompts) and push (148 prompts) tasks may have contributed fewer
	than 50 training examples each — not enough for meaningful policy updates. 600 steps at
	2 prompts/step = 1,200 unique samples, covering ~80% of the training set once.

	No other changes needed. Just increase the step count.

	---

	### Change 3: Replace LR Schedule

	What: `lr_scheduler_type = "cosine" → "constant_with_warmup"`

	Why: The V4 cosine schedule decayed to ~1.5×10⁻¹⁰ by step 200. The model spent the
	last 70 steps (130→200) with an effectively zero learning rate. This is why eval plateaued
	at step 130 — not capacity, not reward ceiling, not APO resistance. The optimizer ran out of
	gradient magnitude.

	```python
	# In GRPOConfig:
	lr_scheduler_type = "constant_with_warmup"
	warmup_ratio = 0.05 # 5% warmup (30 steps out of 600)
	```

	---

	### Change 4: Raise the Learning Rate

	What: `LEARNING_RATE = 2e-6 → 5e-6`

	Why: V4's `train/grad_norm = 0.065` was low. The model can tolerate stronger updates.
	At 0.5B with LoRA r=16, 5e-6 is within the safe range per Dr. GRPO's Appendix G. Combined
	with the constant schedule, this gives sustained gradient magnitude throughout the 600-step run.

	---

	### V4.1 Config Summary

	```python
	MAX_STEPS = 600 # was 200
	LEARNING_RATE = 5e-6 # was 2e-6
	lr_scheduler_type = "constant_with_warmup" # was cosine
	warmup_ratio = 0.05 # was 0.1 (keep short since constant schedule)

	# Everything else UNCHANGED from V4:
	NUM_GENERATIONS = 16
	MAX_COMPLETION_LENGTH = 512
	TEMPERATURE = 1.0
	BETA = 0.0
	SCALE_REWARDS = False
	BATCH_SIZE = 2
	GRAD_ACCUM = 1
	LORA_R = 16
	LORA_ALPHA = 32
	```

	Also: Add per-task reward breakdown to `EvalRewardCallback`. This is essential for V4.2
	decisions — you need to know which specific tasks are gaining and which are stuck.

	```python
	# In EvalRewardCallback._run_eval(), track per-task:
	task_rewards = {"extraction": [], "sql_qa": [], "insights": [], "push": []}
	for record in subset:
	msgs = record["prompt"]
	text = tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
	user_txt = " ".join(m["content"] for m in msgs if m["role"] == "user")
	task = _classify_task_type(user_txt)
	# ... generate response ...
	r = self.reward_fn([response], [text])[0]
	task_rewards[task].append(r)

	per_task = {t: sum(v)/len(v) for t, v in task_rewards.items() if v}
	wandb.log({"eval/" + k: v for k, v in per_task.items()}, step=state.global_step)
	```

	---

	## What to Observe During V4.1

	### Must watch in real time

	\| Metric \| Expected \| Stop if \|
	\|---\|---\|---\|
	\| `eval/mean_reward` trend \| Improving past step 130, continuing to ~step 400 \| Plateaus before step 200 (same as V4) \|
	\| `eval/extraction` \| Jumps significantly from V4's ~0.17 baseline \| Still below 0.20 after step 100 (parser fix didn't help) \|
	\| `train/completion_length` \| 100–400 tokens \| Hits 512 ceiling (need to raise MAX_COMPLETION_LENGTH) \|
	\| `train/frac_reward_zero_std` \| < 0.2 \| Sustained > 0.5 \|
	\| `train/grad_norm` \| 0.05–0.5 \| Spikes > 2.0 (LR too high) \|
	\| `train/kl` \| Any value (KL=0 is void at β=0) \| Ignore entirely \|

	### Questions to answer from the run

	1. Did the parser fix change extraction reward?
	Compare eval/extraction at step 10 (V4.1) vs V4's calibration baseline (~0.17). If it
	jumps to 0.35+, the parser was the bottleneck. If it stays at ~0.17, the model genuinely
	can't produce valid JSON at 0.5B.

	2. Did eval continue improving past step 130?
	The key question. If yes, the V4 plateau was entirely the LR schedule and not a capacity
	limit. If no (plateaus at same ~0.476), the 0.5B model has a real ceiling.

	3. Which task has the lowest per-task reward at step 600?
	This determines the V4.2 priority. Whichever task scores lowest is either (a) reward
	function too coarse, (b) 0.5B capacity insufficient, or (c) underrepresented in training.

	4. What is eval/mean_reward at step 600?
	Target: ≥ 0.55. If reached, scale to 3.7B. If stuck at ~0.48, proceed to V4.2.

	---

	## V4.2 — Conditional on V4.1 Results

	V4.2 depends entirely on what V4.1 tells you. Three scenarios:

	---

	### Scenario A: V4.1 reaches ≥ 0.55 eval reward

	Conclusion: 0.5B-Instruct is maximized. The recipe is validated.
	V4.2 = Scale to 3.7B-Instruct. Apply the V4.1 config with these adjustments:

	```python
	MODEL_ID = "Polygl0t/Tucano2-qwen-3.7B-Instruct"
	NUM_GENERATIONS = 8 # VRAM constraint at 3.7B (was 16)
	MAX_COMPLETION_LENGTH = 1024 # 3.7B can afford richer output (was 512)
	LEARNING_RATE = 2e-6 # Larger model = smaller LR (was 5e-6)
	MAX_STEPS = 400 # Fewer steps needed, larger model learns faster
	MAX_SEQ_LENGTH = 2048 # Unchanged
	```

	Same reward functions, same schedule, same everything else. The qualitative findings transfer.
	Numerical hyperparameters don't — use the values above.

	---

	### Scenario B: V4.1 plateaus at ~0.476 (same as V4), eval/extraction still ≈ 0.17

	Conclusion: Parser fix didn't help — the model can't produce valid JSON at 0.5B.
	V4.2 = Switch base model.

	Use `Polygl0t/Tucano2-qwen-0.5B-Base` (no APO, no SFT) with a minimal SFT warm-up first:

	```python
	# Step 1: 1-epoch SFT on extraction pairs only (~590 pairs)
	# 30 minutes on L4, teaches JSON output format
	MODEL_ID = "Polygl0t/Tucano2-qwen-0.5B-Base"
	# SFT with loss on assistant turns only, 1 epoch

	# Step 2: Same V4.1 GRPO config on top of SFT checkpoint
	# If extraction still fails after SFT warm-up → 0.5B genuinely can't do structured output
	# → skip 0.5B entirely, go directly to 3.7B-Instruct
	```

	---

	### Scenario C: V4.1 eval improves but one task type lags far behind others

	Conclusion: Task imbalance or reward function ceiling on the lagging task.
	V4.2 = Targeted fix for the weakest task. Identify it from `eval/extraction`,
	`eval/sql_qa`, `eval/insights`, `eval/push` breakdown, then apply ONE of:

	If extraction lags (reward < 0.25 at step 600):
	Reward function ceiling. Add semantic field-value scoring:
	```python
	# In reward_extraction: add bonus for correct field VALUES, not just field PRESENCE
	if data.get("sentiment") in VALID_SENTIMENTS: score += 0.05 # was already there
	# Add: exact-match bonus against reference answer if available
	# Add: partial credit for correct complaint_category even if sentiment wrong
	```

	If sql_qa lags (reward < 0.20 at step 600):
	Capacity limit at 0.5B for analytical tasks. Accept this and plan 3.7B for sql_qa.
	Keep 0.5B for extraction+push only, route sql_qa+insights to 3.7B in production.

	If insights lags (reward < 0.20 at step 600):
	Most likely data starvation (only ~148 insights prompts in training set). Generate 200 more
	insights pairs from `commerce.db` before V4.2:
	```bash
	# Quick synthetic expansion using existing generate_pairs.py logic
	# Target: 350 total insights pairs (was 148)
	# Source: random sample from orders_enriched WHERE sentiment='negative'
	```

	---

	## Decision Tree

	```
	Run V4.1 (600 steps, 5e-6 LR, constant schedule, parser fix)
	│
	├─ eval ≥ 0.55 at step 600?
	│ YES → Scenario A: Scale to 3.7B-Instruct with V4.1 config
	│
	├─ eval plateaus at ~0.476 AND extraction still ≈ 0.17?
	│ YES → Scenario B: Switch to 0.5B-Base with SFT warm-up
	│
	└─ eval improves but one task type consistently < 0.20?
	YES → Scenario C: Targeted fix for weakest task
	sql_qa weak → accept, plan 3.7B
	insights weak → generate more pairs
	extraction weak → refine reward function
	```

	---

	V4 validated the recipe. V4.1 exhausts 0.5B before spending 3.7B compute.