File size: 10,753 Bytes
958f6d7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
# V4.1 / V4.2 Handoff

**Date:** 2026-04-27  
**Context:** V4 completed 200 steps, eval_best=0.476 at step 130. Training plateaued due to
LR decay and data starvation (13.5% of one epoch seen). Model learned (+33% over SFT baseline).
Recipe is validated. V4.1 squeezes the 0.5B before scaling to 3.7B.

---

## V4.1 β€” Four Changes Only

### Change 1: Fix the JSON Parser

**What:** Replace the regex-based `_extract_json()` in the reward functions with a robust
parser that handles Portuguese decimal commas and LLM formatting quirks.

**Why:** The model is writing `"sentiment_score": 4,5` (correct PT-BR format). Your parser
calls `json.loads()`, which fails, and scores the completion near-zero. The model is being
penalized for correct behavior. This is reward misspecification β€” the single highest-ROI fix
available at zero training cost.

**Do this before launching any training run.** Run it against `data/pairs/eval.jsonl` first
and measure how many previously-failing extractions now parse correctly.

```python
import json, re

def _normalize_pt_decimals(s: str) -> str:
    """Convert PT-BR decimals (4,5) to JSON-valid (4.5), only outside quoted strings."""
    result, in_string, escape_next = [], False, False
    i = 0
    while i < len(s):
        c = s[i]
        if escape_next:
            result.append(c); escape_next = False; i += 1; continue
        if c == '\\' and in_string:
            result.append(c); escape_next = True; i += 1; continue
        if c == '"':
            in_string = not in_string; result.append(c); i += 1; continue
        if not in_string:
            m = re.match(r'(\d+),(\d+)', s[i:])
            if m:
                result.append(m.group(1) + '.' + m.group(2))
                i += len(m.group(0)); continue
        result.append(c); i += 1
    return ''.join(result)

def _extract_json(text: str) -> dict | None:
    stripped = re.sub(r'^```(?:json)?\s*|\s*```$', '', text.strip(), flags=re.MULTILINE).strip()
    for attempt in [stripped, _normalize_pt_decimals(stripped)]:
        try:
            result = json.loads(attempt)
            if isinstance(result, dict):
                return result
        except (json.JSONDecodeError, TypeError):
            pass
        # Try finding first {...} block
        match = re.search(r'\{[\s\S]*\}', attempt)
        if match:
            try:
                result = json.loads(match.group())
                if isinstance(result, dict):
                    return result
            except (json.JSONDecodeError, TypeError):
                pass
    return None
```

---

### Change 2: Triple the Training Steps

**What:** `MAX_STEPS = 200 β†’ 600`

**Why:** V4 trained on 13.5% of available data. Of ~1,480 training prompts, only ~400 were
sampled. The insights (148 prompts) and push (148 prompts) tasks may have contributed fewer
than 50 training examples each β€” not enough for meaningful policy updates. 600 steps at
2 prompts/step = 1,200 unique samples, covering ~80% of the training set once.

No other changes needed. Just increase the step count.

---

### Change 3: Replace LR Schedule

**What:** `lr_scheduler_type = "cosine" β†’ "constant_with_warmup"`

**Why:** The V4 cosine schedule decayed to ~1.5Γ—10⁻¹⁰ by step 200. The model spent the
last 70 steps (130β†’200) with an effectively zero learning rate. This is why eval plateaued
at step 130 β€” not capacity, not reward ceiling, not APO resistance. The optimizer ran out of
gradient magnitude.

```python
# In GRPOConfig:
lr_scheduler_type = "constant_with_warmup"
warmup_ratio      = 0.05   # 5% warmup (30 steps out of 600)
```

---

### Change 4: Raise the Learning Rate

**What:** `LEARNING_RATE = 2e-6 β†’ 5e-6`

**Why:** V4's `train/grad_norm = 0.065` was low. The model can tolerate stronger updates.
At 0.5B with LoRA r=16, 5e-6 is within the safe range per Dr. GRPO's Appendix G. Combined
with the constant schedule, this gives sustained gradient magnitude throughout the 600-step run.

---

### V4.1 Config Summary

```python
MAX_STEPS              = 600       # was 200
LEARNING_RATE          = 5e-6      # was 2e-6
lr_scheduler_type      = "constant_with_warmup"   # was cosine
warmup_ratio           = 0.05      # was 0.1 (keep short since constant schedule)

# Everything else UNCHANGED from V4:
NUM_GENERATIONS        = 16
MAX_COMPLETION_LENGTH  = 512
TEMPERATURE            = 1.0
BETA                   = 0.0
SCALE_REWARDS          = False
BATCH_SIZE             = 2
GRAD_ACCUM             = 1
LORA_R                 = 16
LORA_ALPHA             = 32
```

**Also:** Add per-task reward breakdown to `EvalRewardCallback`. This is essential for V4.2
decisions β€” you need to know which specific tasks are gaining and which are stuck.

```python
# In EvalRewardCallback._run_eval(), track per-task:
task_rewards = {"extraction": [], "sql_qa": [], "insights": [], "push": []}
for record in subset:
    msgs     = record["prompt"]
    text     = tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
    user_txt = " ".join(m["content"] for m in msgs if m["role"] == "user")
    task     = _classify_task_type(user_txt)
    # ... generate response ...
    r = self.reward_fn([response], [text])[0]
    task_rewards[task].append(r)

per_task = {t: sum(v)/len(v) for t, v in task_rewards.items() if v}
wandb.log({"eval/" + k: v for k, v in per_task.items()}, step=state.global_step)
```

---

## What to Observe During V4.1

### Must watch in real time

| Metric | Expected | Stop if |
|---|---|---|
| `eval/mean_reward` trend | Improving past step 130, continuing to ~step 400 | Plateaus before step 200 (same as V4) |
| `eval/extraction` | Jumps significantly from V4's ~0.17 baseline | Still below 0.20 after step 100 (parser fix didn't help) |
| `train/completion_length` | 100–400 tokens | Hits 512 ceiling (need to raise MAX_COMPLETION_LENGTH) |
| `train/frac_reward_zero_std` | < 0.2 | Sustained > 0.5 |
| `train/grad_norm` | 0.05–0.5 | Spikes > 2.0 (LR too high) |
| `train/kl` | Any value (KL=0 is void at Ξ²=0) | Ignore entirely |

### Questions to answer from the run

1. **Did the parser fix change extraction reward?**  
   Compare eval/extraction at step 10 (V4.1) vs V4's calibration baseline (~0.17). If it
   jumps to 0.35+, the parser was the bottleneck. If it stays at ~0.17, the model genuinely
   can't produce valid JSON at 0.5B.

2. **Did eval continue improving past step 130?**  
   The key question. If yes, the V4 plateau was entirely the LR schedule and not a capacity
   limit. If no (plateaus at same ~0.476), the 0.5B model has a real ceiling.

3. **Which task has the lowest per-task reward at step 600?**  
   This determines the V4.2 priority. Whichever task scores lowest is either (a) reward
   function too coarse, (b) 0.5B capacity insufficient, or (c) underrepresented in training.

4. **What is eval/mean_reward at step 600?**  
   Target: β‰₯ 0.55. If reached, scale to 3.7B. If stuck at ~0.48, proceed to V4.2.

---

## V4.2 β€” Conditional on V4.1 Results

V4.2 depends entirely on what V4.1 tells you. Three scenarios:

---

### Scenario A: V4.1 reaches β‰₯ 0.55 eval reward

**Conclusion:** 0.5B-Instruct is maximized. The recipe is validated.  
**V4.2 = Scale to 3.7B-Instruct.** Apply the V4.1 config with these adjustments:

```python
MODEL_ID               = "Polygl0t/Tucano2-qwen-3.7B-Instruct"
NUM_GENERATIONS        = 8         # VRAM constraint at 3.7B (was 16)
MAX_COMPLETION_LENGTH  = 1024      # 3.7B can afford richer output (was 512)
LEARNING_RATE          = 2e-6      # Larger model = smaller LR (was 5e-6)
MAX_STEPS              = 400       # Fewer steps needed, larger model learns faster
MAX_SEQ_LENGTH         = 2048      # Unchanged
```

Same reward functions, same schedule, same everything else. The qualitative findings transfer.
Numerical hyperparameters don't β€” use the values above.

---

### Scenario B: V4.1 plateaus at ~0.476 (same as V4), eval/extraction still β‰ˆ 0.17

**Conclusion:** Parser fix didn't help β€” the model can't produce valid JSON at 0.5B.  
**V4.2 = Switch base model.**

Use `Polygl0t/Tucano2-qwen-0.5B-Base` (no APO, no SFT) with a minimal SFT warm-up first:

```python
# Step 1: 1-epoch SFT on extraction pairs only (~590 pairs)
# 30 minutes on L4, teaches JSON output format
MODEL_ID = "Polygl0t/Tucano2-qwen-0.5B-Base"
# SFT with loss on assistant turns only, 1 epoch

# Step 2: Same V4.1 GRPO config on top of SFT checkpoint
# If extraction still fails after SFT warm-up β†’ 0.5B genuinely can't do structured output
# β†’ skip 0.5B entirely, go directly to 3.7B-Instruct
```

---

### Scenario C: V4.1 eval improves but one task type lags far behind others

**Conclusion:** Task imbalance or reward function ceiling on the lagging task.  
**V4.2 = Targeted fix for the weakest task.** Identify it from `eval/extraction`,
`eval/sql_qa`, `eval/insights`, `eval/push` breakdown, then apply ONE of:

**If extraction lags (reward < 0.25 at step 600):**  
Reward function ceiling. Add semantic field-value scoring:
```python
# In reward_extraction: add bonus for correct field VALUES, not just field PRESENCE
if data.get("sentiment") in VALID_SENTIMENTS:    score += 0.05  # was already there
# Add: exact-match bonus against reference answer if available
# Add: partial credit for correct complaint_category even if sentiment wrong
```

**If sql_qa lags (reward < 0.20 at step 600):**  
Capacity limit at 0.5B for analytical tasks. Accept this and plan 3.7B for sql_qa.
Keep 0.5B for extraction+push only, route sql_qa+insights to 3.7B in production.

**If insights lags (reward < 0.20 at step 600):**  
Most likely data starvation (only ~148 insights prompts in training set). Generate 200 more
insights pairs from `commerce.db` before V4.2:
```bash
# Quick synthetic expansion using existing generate_pairs.py logic
# Target: 350 total insights pairs (was 148)
# Source: random sample from orders_enriched WHERE sentiment='negative'
```

---

## Decision Tree

```
Run V4.1 (600 steps, 5e-6 LR, constant schedule, parser fix)
        β”‚
        β”œβ”€ eval β‰₯ 0.55 at step 600?
        β”‚       YES β†’ Scenario A: Scale to 3.7B-Instruct with V4.1 config
        β”‚
        β”œβ”€ eval plateaus at ~0.476 AND extraction still β‰ˆ 0.17?
        β”‚       YES β†’ Scenario B: Switch to 0.5B-Base with SFT warm-up
        β”‚
        └─ eval improves but one task type consistently < 0.20?
                YES β†’ Scenario C: Targeted fix for weakest task
                      sql_qa weak β†’ accept, plan 3.7B
                      insights weak β†’ generate more pairs
                      extraction weak β†’ refine reward function
```

---

*V4 validated the recipe. V4.1 exhausts 0.5B before spending 3.7B compute.*