V4.1 / V4.2 Handoff
Date: 2026-04-27
Context: V4 completed 200 steps, eval_best=0.476 at step 130. Training plateaued due to
LR decay and data starvation (13.5% of one epoch seen). Model learned (+33% over SFT baseline).
Recipe is validated. V4.1 squeezes the 0.5B before scaling to 3.7B.
V4.1 β Four Changes Only
Change 1: Fix the JSON Parser
What: Replace the regex-based _extract_json() in the reward functions with a robust
parser that handles Portuguese decimal commas and LLM formatting quirks.
Why: The model is writing "sentiment_score": 4,5 (correct PT-BR format). Your parser
calls json.loads(), which fails, and scores the completion near-zero. The model is being
penalized for correct behavior. This is reward misspecification β the single highest-ROI fix
available at zero training cost.
Do this before launching any training run. Run it against data/pairs/eval.jsonl first
and measure how many previously-failing extractions now parse correctly.
import json, re
def _normalize_pt_decimals(s: str) -> str:
"""Convert PT-BR decimals (4,5) to JSON-valid (4.5), only outside quoted strings."""
result, in_string, escape_next = [], False, False
i = 0
while i < len(s):
c = s[i]
if escape_next:
result.append(c); escape_next = False; i += 1; continue
if c == '\\' and in_string:
result.append(c); escape_next = True; i += 1; continue
if c == '"':
in_string = not in_string; result.append(c); i += 1; continue
if not in_string:
m = re.match(r'(\d+),(\d+)', s[i:])
if m:
result.append(m.group(1) + '.' + m.group(2))
i += len(m.group(0)); continue
result.append(c); i += 1
return ''.join(result)
def _extract_json(text: str) -> dict | None:
stripped = re.sub(r'^```(?:json)?\s*|\s*```$', '', text.strip(), flags=re.MULTILINE).strip()
for attempt in [stripped, _normalize_pt_decimals(stripped)]:
try:
result = json.loads(attempt)
if isinstance(result, dict):
return result
except (json.JSONDecodeError, TypeError):
pass
# Try finding first {...} block
match = re.search(r'\{[\s\S]*\}', attempt)
if match:
try:
result = json.loads(match.group())
if isinstance(result, dict):
return result
except (json.JSONDecodeError, TypeError):
pass
return None
Change 2: Triple the Training Steps
What: MAX_STEPS = 200 β 600
Why: V4 trained on 13.5% of available data. Of ~1,480 training prompts, only ~400 were sampled. The insights (148 prompts) and push (148 prompts) tasks may have contributed fewer than 50 training examples each β not enough for meaningful policy updates. 600 steps at 2 prompts/step = 1,200 unique samples, covering ~80% of the training set once.
No other changes needed. Just increase the step count.
Change 3: Replace LR Schedule
What: lr_scheduler_type = "cosine" β "constant_with_warmup"
Why: The V4 cosine schedule decayed to ~1.5Γ10β»ΒΉβ° by step 200. The model spent the last 70 steps (130β200) with an effectively zero learning rate. This is why eval plateaued at step 130 β not capacity, not reward ceiling, not APO resistance. The optimizer ran out of gradient magnitude.
# In GRPOConfig:
lr_scheduler_type = "constant_with_warmup"
warmup_ratio = 0.05 # 5% warmup (30 steps out of 600)
Change 4: Raise the Learning Rate
What: LEARNING_RATE = 2e-6 β 5e-6
Why: V4's train/grad_norm = 0.065 was low. The model can tolerate stronger updates.
At 0.5B with LoRA r=16, 5e-6 is within the safe range per Dr. GRPO's Appendix G. Combined
with the constant schedule, this gives sustained gradient magnitude throughout the 600-step run.
V4.1 Config Summary
MAX_STEPS = 600 # was 200
LEARNING_RATE = 5e-6 # was 2e-6
lr_scheduler_type = "constant_with_warmup" # was cosine
warmup_ratio = 0.05 # was 0.1 (keep short since constant schedule)
# Everything else UNCHANGED from V4:
NUM_GENERATIONS = 16
MAX_COMPLETION_LENGTH = 512
TEMPERATURE = 1.0
BETA = 0.0
SCALE_REWARDS = False
BATCH_SIZE = 2
GRAD_ACCUM = 1
LORA_R = 16
LORA_ALPHA = 32
Also: Add per-task reward breakdown to EvalRewardCallback. This is essential for V4.2
decisions β you need to know which specific tasks are gaining and which are stuck.
# In EvalRewardCallback._run_eval(), track per-task:
task_rewards = {"extraction": [], "sql_qa": [], "insights": [], "push": []}
for record in subset:
msgs = record["prompt"]
text = tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
user_txt = " ".join(m["content"] for m in msgs if m["role"] == "user")
task = _classify_task_type(user_txt)
# ... generate response ...
r = self.reward_fn([response], [text])[0]
task_rewards[task].append(r)
per_task = {t: sum(v)/len(v) for t, v in task_rewards.items() if v}
wandb.log({"eval/" + k: v for k, v in per_task.items()}, step=state.global_step)
What to Observe During V4.1
Must watch in real time
| Metric | Expected | Stop if |
|---|---|---|
eval/mean_reward trend |
Improving past step 130, continuing to ~step 400 | Plateaus before step 200 (same as V4) |
eval/extraction |
Jumps significantly from V4's ~0.17 baseline | Still below 0.20 after step 100 (parser fix didn't help) |
train/completion_length |
100β400 tokens | Hits 512 ceiling (need to raise MAX_COMPLETION_LENGTH) |
train/frac_reward_zero_std |
< 0.2 | Sustained > 0.5 |
train/grad_norm |
0.05β0.5 | Spikes > 2.0 (LR too high) |
train/kl |
Any value (KL=0 is void at Ξ²=0) | Ignore entirely |
Questions to answer from the run
Did the parser fix change extraction reward?
Compare eval/extraction at step 10 (V4.1) vs V4's calibration baseline (~0.17). If it jumps to 0.35+, the parser was the bottleneck. If it stays at ~0.17, the model genuinely can't produce valid JSON at 0.5B.Did eval continue improving past step 130?
The key question. If yes, the V4 plateau was entirely the LR schedule and not a capacity limit. If no (plateaus at same ~0.476), the 0.5B model has a real ceiling.Which task has the lowest per-task reward at step 600?
This determines the V4.2 priority. Whichever task scores lowest is either (a) reward function too coarse, (b) 0.5B capacity insufficient, or (c) underrepresented in training.What is eval/mean_reward at step 600?
Target: β₯ 0.55. If reached, scale to 3.7B. If stuck at ~0.48, proceed to V4.2.
V4.2 β Conditional on V4.1 Results
V4.2 depends entirely on what V4.1 tells you. Three scenarios:
Scenario A: V4.1 reaches β₯ 0.55 eval reward
Conclusion: 0.5B-Instruct is maximized. The recipe is validated.
V4.2 = Scale to 3.7B-Instruct. Apply the V4.1 config with these adjustments:
MODEL_ID = "Polygl0t/Tucano2-qwen-3.7B-Instruct"
NUM_GENERATIONS = 8 # VRAM constraint at 3.7B (was 16)
MAX_COMPLETION_LENGTH = 1024 # 3.7B can afford richer output (was 512)
LEARNING_RATE = 2e-6 # Larger model = smaller LR (was 5e-6)
MAX_STEPS = 400 # Fewer steps needed, larger model learns faster
MAX_SEQ_LENGTH = 2048 # Unchanged
Same reward functions, same schedule, same everything else. The qualitative findings transfer. Numerical hyperparameters don't β use the values above.
Scenario B: V4.1 plateaus at ~0.476 (same as V4), eval/extraction still β 0.17
Conclusion: Parser fix didn't help β the model can't produce valid JSON at 0.5B.
V4.2 = Switch base model.
Use Polygl0t/Tucano2-qwen-0.5B-Base (no APO, no SFT) with a minimal SFT warm-up first:
# Step 1: 1-epoch SFT on extraction pairs only (~590 pairs)
# 30 minutes on L4, teaches JSON output format
MODEL_ID = "Polygl0t/Tucano2-qwen-0.5B-Base"
# SFT with loss on assistant turns only, 1 epoch
# Step 2: Same V4.1 GRPO config on top of SFT checkpoint
# If extraction still fails after SFT warm-up β 0.5B genuinely can't do structured output
# β skip 0.5B entirely, go directly to 3.7B-Instruct
Scenario C: V4.1 eval improves but one task type lags far behind others
Conclusion: Task imbalance or reward function ceiling on the lagging task.
V4.2 = Targeted fix for the weakest task. Identify it from eval/extraction,
eval/sql_qa, eval/insights, eval/push breakdown, then apply ONE of:
If extraction lags (reward < 0.25 at step 600):
Reward function ceiling. Add semantic field-value scoring:
# In reward_extraction: add bonus for correct field VALUES, not just field PRESENCE
if data.get("sentiment") in VALID_SENTIMENTS: score += 0.05 # was already there
# Add: exact-match bonus against reference answer if available
# Add: partial credit for correct complaint_category even if sentiment wrong
If sql_qa lags (reward < 0.20 at step 600):
Capacity limit at 0.5B for analytical tasks. Accept this and plan 3.7B for sql_qa.
Keep 0.5B for extraction+push only, route sql_qa+insights to 3.7B in production.
If insights lags (reward < 0.20 at step 600):
Most likely data starvation (only ~148 insights prompts in training set). Generate 200 more
insights pairs from commerce.db before V4.2:
# Quick synthetic expansion using existing generate_pairs.py logic
# Target: 350 total insights pairs (was 148)
# Source: random sample from orders_enriched WHERE sentiment='negative'
Decision Tree
Run V4.1 (600 steps, 5e-6 LR, constant schedule, parser fix)
β
ββ eval β₯ 0.55 at step 600?
β YES β Scenario A: Scale to 3.7B-Instruct with V4.1 config
β
ββ eval plateaus at ~0.476 AND extraction still β 0.17?
β YES β Scenario B: Switch to 0.5B-Base with SFT warm-up
β
ββ eval improves but one task type consistently < 0.20?
YES β Scenario C: Targeted fix for weakest task
sql_qa weak β accept, plan 3.7B
insights weak β generate more pairs
extraction weak β refine reward function
V4 validated the recipe. V4.1 exhausts 0.5B before spending 3.7B compute.