File size: 10,753 Bytes
958f6d7 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 | # V4.1 / V4.2 Handoff
**Date:** 2026-04-27
**Context:** V4 completed 200 steps, eval_best=0.476 at step 130. Training plateaued due to
LR decay and data starvation (13.5% of one epoch seen). Model learned (+33% over SFT baseline).
Recipe is validated. V4.1 squeezes the 0.5B before scaling to 3.7B.
---
## V4.1 β Four Changes Only
### Change 1: Fix the JSON Parser
**What:** Replace the regex-based `_extract_json()` in the reward functions with a robust
parser that handles Portuguese decimal commas and LLM formatting quirks.
**Why:** The model is writing `"sentiment_score": 4,5` (correct PT-BR format). Your parser
calls `json.loads()`, which fails, and scores the completion near-zero. The model is being
penalized for correct behavior. This is reward misspecification β the single highest-ROI fix
available at zero training cost.
**Do this before launching any training run.** Run it against `data/pairs/eval.jsonl` first
and measure how many previously-failing extractions now parse correctly.
```python
import json, re
def _normalize_pt_decimals(s: str) -> str:
"""Convert PT-BR decimals (4,5) to JSON-valid (4.5), only outside quoted strings."""
result, in_string, escape_next = [], False, False
i = 0
while i < len(s):
c = s[i]
if escape_next:
result.append(c); escape_next = False; i += 1; continue
if c == '\\' and in_string:
result.append(c); escape_next = True; i += 1; continue
if c == '"':
in_string = not in_string; result.append(c); i += 1; continue
if not in_string:
m = re.match(r'(\d+),(\d+)', s[i:])
if m:
result.append(m.group(1) + '.' + m.group(2))
i += len(m.group(0)); continue
result.append(c); i += 1
return ''.join(result)
def _extract_json(text: str) -> dict | None:
stripped = re.sub(r'^```(?:json)?\s*|\s*```$', '', text.strip(), flags=re.MULTILINE).strip()
for attempt in [stripped, _normalize_pt_decimals(stripped)]:
try:
result = json.loads(attempt)
if isinstance(result, dict):
return result
except (json.JSONDecodeError, TypeError):
pass
# Try finding first {...} block
match = re.search(r'\{[\s\S]*\}', attempt)
if match:
try:
result = json.loads(match.group())
if isinstance(result, dict):
return result
except (json.JSONDecodeError, TypeError):
pass
return None
```
---
### Change 2: Triple the Training Steps
**What:** `MAX_STEPS = 200 β 600`
**Why:** V4 trained on 13.5% of available data. Of ~1,480 training prompts, only ~400 were
sampled. The insights (148 prompts) and push (148 prompts) tasks may have contributed fewer
than 50 training examples each β not enough for meaningful policy updates. 600 steps at
2 prompts/step = 1,200 unique samples, covering ~80% of the training set once.
No other changes needed. Just increase the step count.
---
### Change 3: Replace LR Schedule
**What:** `lr_scheduler_type = "cosine" β "constant_with_warmup"`
**Why:** The V4 cosine schedule decayed to ~1.5Γ10β»ΒΉβ° by step 200. The model spent the
last 70 steps (130β200) with an effectively zero learning rate. This is why eval plateaued
at step 130 β not capacity, not reward ceiling, not APO resistance. The optimizer ran out of
gradient magnitude.
```python
# In GRPOConfig:
lr_scheduler_type = "constant_with_warmup"
warmup_ratio = 0.05 # 5% warmup (30 steps out of 600)
```
---
### Change 4: Raise the Learning Rate
**What:** `LEARNING_RATE = 2e-6 β 5e-6`
**Why:** V4's `train/grad_norm = 0.065` was low. The model can tolerate stronger updates.
At 0.5B with LoRA r=16, 5e-6 is within the safe range per Dr. GRPO's Appendix G. Combined
with the constant schedule, this gives sustained gradient magnitude throughout the 600-step run.
---
### V4.1 Config Summary
```python
MAX_STEPS = 600 # was 200
LEARNING_RATE = 5e-6 # was 2e-6
lr_scheduler_type = "constant_with_warmup" # was cosine
warmup_ratio = 0.05 # was 0.1 (keep short since constant schedule)
# Everything else UNCHANGED from V4:
NUM_GENERATIONS = 16
MAX_COMPLETION_LENGTH = 512
TEMPERATURE = 1.0
BETA = 0.0
SCALE_REWARDS = False
BATCH_SIZE = 2
GRAD_ACCUM = 1
LORA_R = 16
LORA_ALPHA = 32
```
**Also:** Add per-task reward breakdown to `EvalRewardCallback`. This is essential for V4.2
decisions β you need to know which specific tasks are gaining and which are stuck.
```python
# In EvalRewardCallback._run_eval(), track per-task:
task_rewards = {"extraction": [], "sql_qa": [], "insights": [], "push": []}
for record in subset:
msgs = record["prompt"]
text = tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
user_txt = " ".join(m["content"] for m in msgs if m["role"] == "user")
task = _classify_task_type(user_txt)
# ... generate response ...
r = self.reward_fn([response], [text])[0]
task_rewards[task].append(r)
per_task = {t: sum(v)/len(v) for t, v in task_rewards.items() if v}
wandb.log({"eval/" + k: v for k, v in per_task.items()}, step=state.global_step)
```
---
## What to Observe During V4.1
### Must watch in real time
| Metric | Expected | Stop if |
|---|---|---|
| `eval/mean_reward` trend | Improving past step 130, continuing to ~step 400 | Plateaus before step 200 (same as V4) |
| `eval/extraction` | Jumps significantly from V4's ~0.17 baseline | Still below 0.20 after step 100 (parser fix didn't help) |
| `train/completion_length` | 100β400 tokens | Hits 512 ceiling (need to raise MAX_COMPLETION_LENGTH) |
| `train/frac_reward_zero_std` | < 0.2 | Sustained > 0.5 |
| `train/grad_norm` | 0.05β0.5 | Spikes > 2.0 (LR too high) |
| `train/kl` | Any value (KL=0 is void at Ξ²=0) | Ignore entirely |
### Questions to answer from the run
1. **Did the parser fix change extraction reward?**
Compare eval/extraction at step 10 (V4.1) vs V4's calibration baseline (~0.17). If it
jumps to 0.35+, the parser was the bottleneck. If it stays at ~0.17, the model genuinely
can't produce valid JSON at 0.5B.
2. **Did eval continue improving past step 130?**
The key question. If yes, the V4 plateau was entirely the LR schedule and not a capacity
limit. If no (plateaus at same ~0.476), the 0.5B model has a real ceiling.
3. **Which task has the lowest per-task reward at step 600?**
This determines the V4.2 priority. Whichever task scores lowest is either (a) reward
function too coarse, (b) 0.5B capacity insufficient, or (c) underrepresented in training.
4. **What is eval/mean_reward at step 600?**
Target: β₯ 0.55. If reached, scale to 3.7B. If stuck at ~0.48, proceed to V4.2.
---
## V4.2 β Conditional on V4.1 Results
V4.2 depends entirely on what V4.1 tells you. Three scenarios:
---
### Scenario A: V4.1 reaches β₯ 0.55 eval reward
**Conclusion:** 0.5B-Instruct is maximized. The recipe is validated.
**V4.2 = Scale to 3.7B-Instruct.** Apply the V4.1 config with these adjustments:
```python
MODEL_ID = "Polygl0t/Tucano2-qwen-3.7B-Instruct"
NUM_GENERATIONS = 8 # VRAM constraint at 3.7B (was 16)
MAX_COMPLETION_LENGTH = 1024 # 3.7B can afford richer output (was 512)
LEARNING_RATE = 2e-6 # Larger model = smaller LR (was 5e-6)
MAX_STEPS = 400 # Fewer steps needed, larger model learns faster
MAX_SEQ_LENGTH = 2048 # Unchanged
```
Same reward functions, same schedule, same everything else. The qualitative findings transfer.
Numerical hyperparameters don't β use the values above.
---
### Scenario B: V4.1 plateaus at ~0.476 (same as V4), eval/extraction still β 0.17
**Conclusion:** Parser fix didn't help β the model can't produce valid JSON at 0.5B.
**V4.2 = Switch base model.**
Use `Polygl0t/Tucano2-qwen-0.5B-Base` (no APO, no SFT) with a minimal SFT warm-up first:
```python
# Step 1: 1-epoch SFT on extraction pairs only (~590 pairs)
# 30 minutes on L4, teaches JSON output format
MODEL_ID = "Polygl0t/Tucano2-qwen-0.5B-Base"
# SFT with loss on assistant turns only, 1 epoch
# Step 2: Same V4.1 GRPO config on top of SFT checkpoint
# If extraction still fails after SFT warm-up β 0.5B genuinely can't do structured output
# β skip 0.5B entirely, go directly to 3.7B-Instruct
```
---
### Scenario C: V4.1 eval improves but one task type lags far behind others
**Conclusion:** Task imbalance or reward function ceiling on the lagging task.
**V4.2 = Targeted fix for the weakest task.** Identify it from `eval/extraction`,
`eval/sql_qa`, `eval/insights`, `eval/push` breakdown, then apply ONE of:
**If extraction lags (reward < 0.25 at step 600):**
Reward function ceiling. Add semantic field-value scoring:
```python
# In reward_extraction: add bonus for correct field VALUES, not just field PRESENCE
if data.get("sentiment") in VALID_SENTIMENTS: score += 0.05 # was already there
# Add: exact-match bonus against reference answer if available
# Add: partial credit for correct complaint_category even if sentiment wrong
```
**If sql_qa lags (reward < 0.20 at step 600):**
Capacity limit at 0.5B for analytical tasks. Accept this and plan 3.7B for sql_qa.
Keep 0.5B for extraction+push only, route sql_qa+insights to 3.7B in production.
**If insights lags (reward < 0.20 at step 600):**
Most likely data starvation (only ~148 insights prompts in training set). Generate 200 more
insights pairs from `commerce.db` before V4.2:
```bash
# Quick synthetic expansion using existing generate_pairs.py logic
# Target: 350 total insights pairs (was 148)
# Source: random sample from orders_enriched WHERE sentiment='negative'
```
---
## Decision Tree
```
Run V4.1 (600 steps, 5e-6 LR, constant schedule, parser fix)
β
ββ eval β₯ 0.55 at step 600?
β YES β Scenario A: Scale to 3.7B-Instruct with V4.1 config
β
ββ eval plateaus at ~0.476 AND extraction still β 0.17?
β YES β Scenario B: Switch to 0.5B-Base with SFT warm-up
β
ββ eval improves but one task type consistently < 0.20?
YES β Scenario C: Targeted fix for weakest task
sql_qa weak β accept, plan 3.7B
insights weak β generate more pairs
extraction weak β refine reward function
```
---
*V4 validated the recipe. V4.1 exhausts 0.5B before spending 3.7B compute.* |