V4.1 / V4.2 Handoff

Date: 2026-04-27
Context: V4 completed 200 steps, eval_best=0.476 at step 130. Training plateaued due to LR decay and data starvation (13.5% of one epoch seen). Model learned (+33% over SFT baseline). Recipe is validated. V4.1 squeezes the 0.5B before scaling to 3.7B.

V4.1 — Four Changes Only

Change 1: Fix the JSON Parser

What: Replace the regex-based _extract_json() in the reward functions with a robust parser that handles Portuguese decimal commas and LLM formatting quirks.

Why: The model is writing "sentiment_score": 4,5 (correct PT-BR format). Your parser calls json.loads(), which fails, and scores the completion near-zero. The model is being penalized for correct behavior. This is reward misspecification — the single highest-ROI fix available at zero training cost.

Do this before launching any training run. Run it against data/pairs/eval.jsonl first and measure how many previously-failing extractions now parse correctly.

import json, re

def _normalize_pt_decimals(s: str) -> str:
    """Convert PT-BR decimals (4,5) to JSON-valid (4.5), only outside quoted strings."""
    result, in_string, escape_next = [], False, False
    i = 0
    while i < len(s):
        c = s[i]
        if escape_next:
            result.append(c); escape_next = False; i += 1; continue
        if c == '\\' and in_string:
            result.append(c); escape_next = True; i += 1; continue
        if c == '"':
            in_string = not in_string; result.append(c); i += 1; continue
        if not in_string:
            m = re.match(r'(\d+),(\d+)', s[i:])
            if m:
                result.append(m.group(1) + '.' + m.group(2))
                i += len(m.group(0)); continue
        result.append(c); i += 1
    return ''.join(result)

def _extract_json(text: str) -> dict | None:
    stripped = re.sub(r'^```(?:json)?\s*|\s*```$', '', text.strip(), flags=re.MULTILINE).strip()
    for attempt in [stripped, _normalize_pt_decimals(stripped)]:
        try:
            result = json.loads(attempt)
            if isinstance(result, dict):
                return result
        except (json.JSONDecodeError, TypeError):
            pass
        # Try finding first {...} block
        match = re.search(r'\{[\s\S]*\}', attempt)
        if match:
            try:
                result = json.loads(match.group())
                if isinstance(result, dict):
                    return result
            except (json.JSONDecodeError, TypeError):
                pass
    return None

Change 2: Triple the Training Steps

What: MAX_STEPS = 200 → 600

Why: V4 trained on 13.5% of available data. Of ~1,480 training prompts, only ~400 were sampled. The insights (148 prompts) and push (148 prompts) tasks may have contributed fewer than 50 training examples each — not enough for meaningful policy updates. 600 steps at 2 prompts/step = 1,200 unique samples, covering ~80% of the training set once.

No other changes needed. Just increase the step count.

Change 3: Replace LR Schedule

What: lr_scheduler_type = "cosine" → "constant_with_warmup"

Why: The V4 cosine schedule decayed to ~1.5×10⁻¹⁰ by step 200. The model spent the last 70 steps (130→200) with an effectively zero learning rate. This is why eval plateaued at step 130 — not capacity, not reward ceiling, not APO resistance. The optimizer ran out of gradient magnitude.

# In GRPOConfig:
lr_scheduler_type = "constant_with_warmup"
warmup_ratio      = 0.05   # 5% warmup (30 steps out of 600)

Change 4: Raise the Learning Rate

What: LEARNING_RATE = 2e-6 → 5e-6

Why: V4's train/grad_norm = 0.065 was low. The model can tolerate stronger updates. At 0.5B with LoRA r=16, 5e-6 is within the safe range per Dr. GRPO's Appendix G. Combined with the constant schedule, this gives sustained gradient magnitude throughout the 600-step run.

V4.1 Config Summary

MAX_STEPS              = 600       # was 200
LEARNING_RATE          = 5e-6      # was 2e-6
lr_scheduler_type      = "constant_with_warmup"   # was cosine
warmup_ratio           = 0.05      # was 0.1 (keep short since constant schedule)

# Everything else UNCHANGED from V4:
NUM_GENERATIONS        = 16
MAX_COMPLETION_LENGTH  = 512
TEMPERATURE            = 1.0
BETA                   = 0.0
SCALE_REWARDS          = False
BATCH_SIZE             = 2
GRAD_ACCUM             = 1
LORA_R                 = 16
LORA_ALPHA             = 32

Also: Add per-task reward breakdown to EvalRewardCallback. This is essential for V4.2 decisions — you need to know which specific tasks are gaining and which are stuck.

# In EvalRewardCallback._run_eval(), track per-task:
task_rewards = {"extraction": [], "sql_qa": [], "insights": [], "push": []}
for record in subset:
    msgs     = record["prompt"]
    text     = tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
    user_txt = " ".join(m["content"] for m in msgs if m["role"] == "user")
    task     = _classify_task_type(user_txt)
    # ... generate response ...
    r = self.reward_fn([response], [text])[0]
    task_rewards[task].append(r)

per_task = {t: sum(v)/len(v) for t, v in task_rewards.items() if v}
wandb.log({"eval/" + k: v for k, v in per_task.items()}, step=state.global_step)

What to Observe During V4.1

Must watch in real time

Metric	Expected	Stop if
`eval/mean_reward` trend	Improving past step 130, continuing to ~step 400	Plateaus before step 200 (same as V4)
`eval/extraction`	Jumps significantly from V4's ~0.17 baseline	Still below 0.20 after step 100 (parser fix didn't help)
`train/completion_length`	100–400 tokens	Hits 512 ceiling (need to raise MAX_COMPLETION_LENGTH)
`train/frac_reward_zero_std`	< 0.2	Sustained > 0.5
`train/grad_norm`	0.05–0.5	Spikes > 2.0 (LR too high)
`train/kl`	Any value (KL=0 is void at β=0)	Ignore entirely

Questions to answer from the run

Did the parser fix change extraction reward?
Compare eval/extraction at step 10 (V4.1) vs V4's calibration baseline (~0.17). If it jumps to 0.35+, the parser was the bottleneck. If it stays at ~0.17, the model genuinely can't produce valid JSON at 0.5B.
Did eval continue improving past step 130?
The key question. If yes, the V4 plateau was entirely the LR schedule and not a capacity limit. If no (plateaus at same ~0.476), the 0.5B model has a real ceiling.
Which task has the lowest per-task reward at step 600?
This determines the V4.2 priority. Whichever task scores lowest is either (a) reward function too coarse, (b) 0.5B capacity insufficient, or (c) underrepresented in training.
What is eval/mean_reward at step 600?
Target: ≥ 0.55. If reached, scale to 3.7B. If stuck at ~0.48, proceed to V4.2.

V4.2 — Conditional on V4.1 Results

V4.2 depends entirely on what V4.1 tells you. Three scenarios:

Scenario A: V4.1 reaches ≥ 0.55 eval reward

Conclusion: 0.5B-Instruct is maximized. The recipe is validated.
V4.2 = Scale to 3.7B-Instruct. Apply the V4.1 config with these adjustments:

MODEL_ID               = "Polygl0t/Tucano2-qwen-3.7B-Instruct"
NUM_GENERATIONS        = 8         # VRAM constraint at 3.7B (was 16)
MAX_COMPLETION_LENGTH  = 1024      # 3.7B can afford richer output (was 512)
LEARNING_RATE          = 2e-6      # Larger model = smaller LR (was 5e-6)
MAX_STEPS              = 400       # Fewer steps needed, larger model learns faster
MAX_SEQ_LENGTH         = 2048      # Unchanged

Same reward functions, same schedule, same everything else. The qualitative findings transfer. Numerical hyperparameters don't — use the values above.

Scenario B: V4.1 plateaus at ~0.476 (same as V4), eval/extraction still ≈ 0.17

Conclusion: Parser fix didn't help — the model can't produce valid JSON at 0.5B.
V4.2 = Switch base model.

Use Polygl0t/Tucano2-qwen-0.5B-Base (no APO, no SFT) with a minimal SFT warm-up first:

# Step 1: 1-epoch SFT on extraction pairs only (~590 pairs)
# 30 minutes on L4, teaches JSON output format
MODEL_ID = "Polygl0t/Tucano2-qwen-0.5B-Base"
# SFT with loss on assistant turns only, 1 epoch

# Step 2: Same V4.1 GRPO config on top of SFT checkpoint
# If extraction still fails after SFT warm-up → 0.5B genuinely can't do structured output
# → skip 0.5B entirely, go directly to 3.7B-Instruct

Scenario C: V4.1 eval improves but one task type lags far behind others

Conclusion: Task imbalance or reward function ceiling on the lagging task.
V4.2 = Targeted fix for the weakest task. Identify it from eval/extraction, eval/sql_qa, eval/insights, eval/push breakdown, then apply ONE of:

If extraction lags (reward < 0.25 at step 600):
Reward function ceiling. Add semantic field-value scoring:

# In reward_extraction: add bonus for correct field VALUES, not just field PRESENCE
if data.get("sentiment") in VALID_SENTIMENTS:    score += 0.05  # was already there
# Add: exact-match bonus against reference answer if available
# Add: partial credit for correct complaint_category even if sentiment wrong

If sql_qa lags (reward < 0.20 at step 600):
Capacity limit at 0.5B for analytical tasks. Accept this and plan 3.7B for sql_qa. Keep 0.5B for extraction+push only, route sql_qa+insights to 3.7B in production.

If insights lags (reward < 0.20 at step 600):
Most likely data starvation (only ~148 insights prompts in training set). Generate 200 more insights pairs from commerce.db before V4.2:

# Quick synthetic expansion using existing generate_pairs.py logic
# Target: 350 total insights pairs (was 148)
# Source: random sample from orders_enriched WHERE sentiment='negative'

Decision Tree

Run V4.1 (600 steps, 5e-6 LR, constant schedule, parser fix)
        │
        ├─ eval ≥ 0.55 at step 600?
        │       YES → Scenario A: Scale to 3.7B-Instruct with V4.1 config
        │
        ├─ eval plateaus at ~0.476 AND extraction still ≈ 0.17?
        │       YES → Scenario B: Switch to 0.5B-Base with SFT warm-up
        │
        └─ eval improves but one task type consistently < 0.20?
                YES → Scenario C: Targeted fix for weakest task
                      sql_qa weak → accept, plan 3.7B
                      insights weak → generate more pairs
                      extraction weak → refine reward function

V4 validated the recipe. V4.1 exhausts 0.5B before spending 3.7B compute.