File size: 18,833 Bytes
d1385b0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
# V4.2 Handoff — Closing the 0.5B to Gold Standard

**Date:** 2026-04-27  
**Context:** V4.1 achieved eval_best=0.645 (+35.5% over V4). Parser fix + constant LR
were the two decisive changes. SQL Q&A remains stagnant (+3.8%). Insights swing
(0.84→0.62) is eval noise, not regression. Goal: close remaining gaps before declaring
the 0.5B scientifically complete and making the methodology case.

---

## What V4.1 Left Open (Ordered by Urgency)

| Gap | Evidence | Blocks |
|-----|----------|--------|
| Eval suite too small (n≈2 for insights/push) | Insights swings ±0.22 between consecutive evals | Every subsequent claim about per-task performance |
| SQL Q&A stagnant at +3.8% | Reward function doesn't validate SQL quality | Knowing whether this is reward ceiling or capacity limit |
| Only 40% of training data seen | train/epoch=0.404 at step 600 | Knowing true performance ceiling |
| No reward function audit protocol | Parser bug persisted 3 versions | Next parser-class bug catching you again |
| Single run, no confidence intervals | No error bars on any reported number | Credibility of the methodology case |

---

## V4.2 Changes — Exactly What to Implement

### Change 1: Expand Eval Suite to 50+ Samples (Do Before Training)

**What:** Build a static eval set with minimum 15 samples per task type, stratified and
held fixed across all future runs.

```python
EVAL_SAMPLES_PER_TASK = {
    "extraction": 20,
    "sql_qa":     15,
    "insights":   15,
    "push":       15,
}
# Total: 65 eval samples (was 15 mixed)
```

**How:** Sample from `data/pairs/eval.jsonl`, verify task distribution, save as
`data/pairs/eval_v2_stratified.jsonl`. Never resample — the same 65 prompts must be
used across V4.2 seeds 42, 123, 456.

**Why this is Change 1:** The insights regression (0.84→0.62) made it impossible to
distinguish learning from noise. At n≈2, standard error = ±0.22. At n=15, standard
error drops to ±0.06. Every other change is uninterpretable without this fixed.

**Report format:** `mean ± 1.96 × std/sqrt(n)` for 95% CI on each task.

---

### Change 2: Reward Function Audit Before Training (30-minute Protocol)

**What:** Before launching any training cell, generate 20 completions (5 per task),
manually score them 0-10, compute Spearman ρ between human scores and reward function.

```python
# Cell X — Reward Audit (add between calibration and training)
from scipy.stats import spearmanr

AUDIT_PROMPTS_PER_TASK = 5
audit_human_scores = []   # filled manually by the person running the notebook
audit_auto_scores  = []

# Generate completions at temp=0.1 (deterministic), print each one
# Person assigns 0-10 score, enters into audit_human_scores list
# Then:
rho, p_value = spearmanr(audit_human_scores, audit_auto_scores)
print(f"Reward function calibration: ρ={rho:.2f} (p={p_value:.3f})")
assert rho > 0.7, f"Reward function miscalibrated (ρ={rho:.2f} < 0.70). Fix before training."
```

**Gate:** ρ > 0.70. If below, find the discrepancy (another parser bug, wrong field
weighting, wrong task classifier) before spending GPU hours.

**Why:** The V1-V4 parser bug would have been caught in 30 minutes with this protocol.
This is the single cheapest addition to the methodology.

---

### Change 3: SQL Reward Overhaul

**What:** Replace heuristic vocabulary matching with a validation-aware reward that
distinguishes "mentions SQL keywords" from "produces a correct analytical answer."

```python
def reward_sql_qa_v2(completion: str) -> float:
    answer = strip_think(completion)
    if not answer.strip():
        return 0.0

    score = 0.0

    # Tier 1 (0.30): SQL structure detected
    sql_keywords = ["SELECT", "FROM", "WHERE", "GROUP BY", "ORDER BY", "JOIN", "HAVING", "COUNT", "AVG", "SUM"]
    sql_found = sum(1 for kw in sql_keywords if kw in answer.upper())
    if sql_found >= 3:
        score += 0.30
    elif sql_found >= 1:
        score += 0.15

    # Tier 2 (0.25): Answer has both query and explanation
    has_query  = bool(re.search(r"```sql|SELECT.{5,}FROM", answer, re.IGNORECASE | re.DOTALL))
    has_answer = any(w in answer.lower() for w in ["resultado", "total", "média", "mostra", "portanto"])
    if has_query and has_answer:
        score += 0.25
    elif has_query or has_answer:
        score += 0.12

    # Tier 3 (0.25): Numerical specificity
    numbers = re.findall(r"\d+(?:[.,]\d+)?(?:\s*%)?", answer)
    score += min(0.25, 0.05 * len(numbers))

    # Tier 4 (0.20): Portuguese business domain coherence
    pt_domain = ["pedidos", "clientes", "vendedores", "produtos", "avaliação",
                 "entrega", "reclamação", "satisfação", "categoria", "período"]
    score += min(0.20, 0.04 * sum(1 for w in pt_domain if w in answer.lower()))

    return min(score, 1.0)
```

**Why:** The current SQL reward scores a completion that says "os pedidos de clientes em
2017 totalizaram 15.000" the same as one that says "SELECT COUNT(*) FROM orders WHERE
year=2017 → result: 15,000 pedidos". The V2 reward can't tell the difference. The model
found the easier path (domain vocabulary + numbers) and stopped improving.

**Expected outcome:** Two possibilities after the training run:
- SQL reward improves +20%+ → reward was the ceiling, model has capacity
- SQL reward still stagnates → confirmed 0.5B capacity limit for analytical reasoning

Both are valid scientific conclusions. The point is to distinguish them.

---

### Change 4: Multi-Epoch Training (1,500 Steps)

**What:** `MAX_STEPS = 600 → 1500`, training for ~2.5 full epochs with dataset shuffling.

```python
MAX_STEPS        = 1500    # was 600
SAVE_STEPS       = 100     # was 50
EVAL_STEPS       = 50      # was 20, now more frequent per-epoch boundary
```

**Why:** V4.1 saw only 40% of training data. The eval was still improving at step 500
(best eval at step 500/600). There is signal remaining in the data. 1,500 steps covers
the full dataset ~2× with shuffling between passes.

**Risk monitoring:** Watch `eval/mean_reward` vs `train/reward` divergence. If eval
plateaus while train keeps rising past step 800, the model is overfitting. Stop training
and save the pre-overfitting checkpoint.

**Shuffling:** Ensure `shuffle=True` in the DataLoader, or use a different seed per
epoch for `Dataset.shuffle()`. Seeing the same prompts in the same order each epoch
eliminates the diversity benefit.

**Estimated cost:** ~12 hours on L4. Run overnight.

---

### Change 5: GDPO per-reward normalization

```python
# GDPO: normalize each reward component separately before summing
# Instead of: reward = sum(extraction + sql + insights + push) → normalize batch
# Do:         normalize each component independently, then sum

def gdpo_normalize(component_rewards: dict[str, list[float]]) -> list[float]:
    """Per-component normalization before aggregation (GDPO 2601.05242 §3.1)."""
    normalized = {}
    for task, rewards in component_rewards.items():
        rewards_t = torch.tensor(rewards, dtype=torch.float32)
        std = rewards_t.std()
        if std > 1e-8:
            normalized[task] = ((rewards_t - rewards_t.mean()) / std).tolist()
        else:
            normalized[task] = [0.0] * len(rewards)  # zero-variance group
    # Sum normalized components per sample
    n = len(next(iter(normalized.values())))
    return [sum(normalized[t][i] for t in normalized) for i in range(n)]
```

This requires `commerce_reward_fn` to return per-component rewards (not just the sum), and
the trainer to call `gdpo_normalize` before computing advantages. Moderate effort — requires
a custom trainer subclass.

---

### Change 6: Dynamic Task Weighting (MT-GRPO IWU)

**What:** Track per-task reward improvement rates every N eval steps and increase
sampling probability for stagnating tasks.

```python
class DynamicTaskSampler:
    """
    MT-GRPO §3.2: Improvement-aware Weight Update.
    Upweights tasks with stagnating reward, downweights converging tasks.
    """
    def __init__(self, tasks, initial_weight=0.25, update_interval=50):
        self.tasks    = tasks
        self.weights  = {t: initial_weight for t in tasks}
        self.history  = {t: [] for t in tasks}
        self.interval = update_interval

    def update(self, step: int, per_task_rewards: dict):
        if step % self.interval != 0 or step == 0:
            return
        for task, reward in per_task_rewards.items():
            self.history[task].append(reward)
            if len(self.history[task]) >= 2:
                improvement = self.history[task][-1] - self.history[task][-2]
                if improvement < 0.01:         # stagnating
                    self.weights[task] = min(0.60, self.weights[task] * 1.3)
                elif improvement > 0.05:       # improving fast
                    self.weights[task] = max(0.10, self.weights[task] * 0.85)
        # Normalize to sum to 1
        total = sum(self.weights.values())
        self.weights = {t: w / total for t, w in self.weights.items()}

    def sample_indices(self, dataset, n_samples: int) -> list[int]:
        """Sample indices with probability proportional to task weights."""
        task_indices = {t: [] for t in self.tasks}
        for i, record in enumerate(dataset):
            user_txt = " ".join(m["content"] for m in record["prompt"] if m["role"] == "user")
            task = _classify_task_type(user_txt)
            task_indices[task].append(i)

        sampled = []
        for task, weight in self.weights.items():
            n = max(1, int(n_samples * weight))
            pool = task_indices.get(task, [])
            if pool:
                sampled.extend(random.sample(pool, min(n, len(pool))))
        return sampled
```

**Why:** SQL Q&A improved +3.8% vs insights +68% under equal weighting. MT-GRPO's
Theorem 1 proves GRPO allocates gradient budget to higher-variance tasks. SQL gets
starved. IWU corrects this by progressively increasing SQL's sampling probability as
its improvement rate drops below threshold.

**Log:** `wandb.log({"sampler/sql_weight": sampler.weights["sql_qa"], ...})` at every
update. You want to see SQL weight increasing from 0.40 (initial) toward 0.60+ if the
model is still stagnating.

---

### Change 7: Three Seeds for Reproducibility

**What:** Run V4.2 with `seed ∈ {42, 123, 456}`. Report `mean ± std` across seeds
for all headline metrics.

```python
# In Cell 3:
SEEDS = [42, 123, 456]
CURRENT_SEED = 42   # change per run

# In GRPOConfig and Dataset prep:
seed = CURRENT_SEED
random.seed(seed)
torch.manual_seed(seed)
```

**Why:** Three seeds is the minimum for a credible ML result. A single lucky or unlucky
random initialization can produce ±0.05 variance in eval reward. With three seeds, you
can report `0.645 ± 0.02` instead of just `0.645` — the former is publishable, the
latter is an observation.

**Cost:** 3 × ~12h = ~36h on L4. Do these in parallel if you have access to multiple
GPUs, or sequentially overnight.

**Report format:**
```
| Task         | Seed 42 | Seed 123 | Seed 456 | Mean ± 95% CI |
|---|---|---|---|---|
| Extraction   | ...     | ...      | ...      | X.XX ± 0.0X   |
| SQL Q&A      | ...     | ...      | ...      | X.XX ± 0.0X   |
| Insights     | ...     | ...      | ...      | X.XX ± 0.0X   |
| Push         | ...     | ...      | ...      | X.XX ± 0.0X   |
| **Mean**     | ...     | ...      | ...      | **X.XX ± 0.0X** |
```

---

### Change 8: Best checkpoint saving

The V4.1 config had `save_total_limit=5` and `save_only_model=True` but no explicit best-checkpoint
logic. `GRPOTrainer` doesn't have a native `load_best_model_at_end` equivalent the way `Trainer` does.
The correct pattern is to track best eval reward in `EvalRewardCallback` and save explicitly when
it improves:

```python
# In EvalRewardCallback.on_step_end, after computing mean_reward:
if improved:
    self.best_reward = mean_reward
    self.best_step   = state.global_step
    self.no_improve_count = 0
    # Save best checkpoint explicitly
    best_path = ADAPTER_DIR / "best_checkpoint"
    model.save_pretrained(str(best_path))
    tokenizer.save_pretrained(str(best_path))
    print(f"  ✓ Best checkpoint saved → {best_path} (reward={mean_reward:.4f})")
```

This guarantees the adapter saved in `best_checkpoint/` is always the peak eval reward, regardless of where
training ends or when early stopping fires. Add this to the `EvalRewardCallback` block and the final Cell 15
should load from `best_checkpoint/` rather than the last training step.

---

## V4.2 Config Summary

```python
# Changes from V4.1
MAX_STEPS                  = 1500     # was 600
EVAL_STEPS                 = 50       # was 20
SAVE_STEPS                 = 100      # was 50
SEEDS                      = [42, 123, 456]  # run separately

# Structural additions (not hyperparameters)
eval_set                   = "data/pairs/eval_v2_stratified.jsonl"  # 65 fixed samples
reward_sql_qa              = reward_sql_qa_v2  # new SQL reward (Change 3)
dynamic_sampler            = DynamicTaskSampler(...)  # Change 5

# Everything else UNCHANGED from V4.1 (these are validated):
NUM_GENERATIONS            = 16
MAX_COMPLETION_LENGTH      = 512
TEMPERATURE                = 1.0
BETA                       = 0.0
SCALE_REWARDS              = False
LEARNING_RATE              = 5e-6
lr_scheduler_type          = "constant_with_warmup"
warmup_ratio               = 0.05
BATCH_SIZE                 = 2
GRAD_ACCUM                 = 1
LORA_R                     = 16
LORA_ALPHA                 = 32
```

---

## What to Observe During V4.2

### The four questions V4.2 must answer

**Q1: Does SQL reward improve with the new reward function?**  
Watch `eval/sql_qa` trajectory. If it improves +15%+ over V4.1 by step 600, the old
reward was the ceiling. If it still stagnates at ~0.547, the 0.5B model has a genuine
capacity limit on analytical reasoning. Both are valid answers — the important thing is
knowing which.

**Q2: Is the insights regression noise or forgetting?**  
With n=15 insights samples, the swing becomes ±0.06 instead of ±0.22. If insights
stabilizes at 0.75-0.85 throughout multi-epoch training, it was noise. If it consistently
drops after step 800, it's catastrophic forgetting — and you need MT-GRPO's constrained
optimization to address it.

**Q3: Does multi-epoch training push eval above 0.70?**  
V4.1 was still improving at step 500 with eval_best=0.645. 1,500 steps of sustained
training with constant LR and full data coverage should push this further. Target: ≥0.70
mean across all tasks.

**Q4: Are results reproducible across seeds?**  
If `mean ± std` across three seeds shows std < 0.03, the result is robust. If std > 0.05,
there's significant initialization sensitivity that needs to be understood before claiming
these numbers.

### WandB metrics to watch in real time

| Metric | Expected | Stop if |
|---|---|---|
| `eval/sql_qa` | Steeper upward slope vs V4.1 | Still flat at step 200 (reward still wrong) |
| `eval/insights` | Stable 0.75-0.85, no big swings | Drops consistently below 0.65 after step 800 |
| `eval/mean_reward` | Continues improving past step 500 | Plateaus at same 0.645 as V4.1 |
| `sampler/sql_weight` | Rising from 0.40 toward 0.55-0.60 | Stays flat (IWU not triggering) |
| `train/loss` vs `eval` | Both decreasing | Train drops, eval plateaus (overfitting) |

---

## Drawing Conclusions for the Methodology Case

V4.2 closes the 0.5B story. After the runs complete, the project demonstrates:

### What was systematically learned across versions

| Version | Failure identified | Fix | Isolated contribution |
|---|---|---|---|
| V1 | 68% extraction task imbalance → task collapse | Rebalanced to 40/40/10/10 | Established balanced multi-task training |
| V2 | DPO near-no-op (9 gradient steps, flat loss) | Replaced DPO with GRPO | Established GRPO as the correct alignment method |
| V3 | Think model APO anchor + 2,628-token `<think>` overhead | Switched to Instruct model | Isolated architectural constraint |
| V4 | Cosine LR decay to zero by step 130 | Constant LR schedule | +370 productive training steps |
| V4 | JSON parser failed on PT-BR decimals | json-repair + PT normalizer | 3.25× measured extraction improvement |
| V4.1 | Stagnant SQL (+3.8%) | New reward function + dynamic weighting | To be measured in V4.2 |
| V4.1 | Single run, no CIs | 3-seed reproducibility | Statistical credibility |

### The scientific argument

Each version was not "trying something different" — it was **testing a specific hypothesis**
with a controlled change. The key methodological strength is causal attribution: when
eval_best went from 0.476 to 0.645, the V4.1 report decomposed the sources:
- ~0.13 from parser fix (measured at step 20, before GRPO learning)
- ~0.13 from GRPO learning (measured from step 20 to peak)

This is the scientific standard: not just "it improved" but "here is why, with evidence."

V4.2 completes this arc by either showing SQL improvement (proving reward was the ceiling)
or confirming capacity limit (proving 0.5B has an irreducible floor on analytical tasks).
Either outcome is a clean, documentable finding.

### The external benchmark gap

One gap remains for true gold-standard credibility: no external benchmark comparison.
The current eval suite is project-internal. For the methodology case, consider running
the V4.2 final checkpoint against a subset of Portuguese NLP benchmarks from the Tucano2
model card (BLUEX, OAB, ENEM) on the specific knowledge domains your training covered.
Even partial coverage demonstrates the model didn't regress on general Portuguese ability
while improving on domain tasks. This is the "catastrophic forgetting" test at the
benchmark level, complementing the within-run insights stability test.

---

## Notebook Structure for V4.2

```
Cell  1: Dependencies + env vars
Cell  2: GPU + Unsloth + TRL verification gate
Cell  3: Config constants (CURRENT_SEED variable)
Cell  4: Load model + critical generation_config overrides
Cell  5: Token ID verification gate
Cell  6: KV cache diagnostic gate
Cell  7: Reward functions v2 (including reward_sql_qa_v2)
Cell  8: Reward function audit (30-min protocol, ρ > 0.70 gate)
Cell  9: Build stratified eval set (65 samples, fixed)
Cell 10: Dataset preparation + DynamicTaskSampler init
Cell 11: Smoke test (1 step, VRAM check)
Cell 12: Probe run (10 steps, clip_ratio > 0 on 3/10 gate)
Cell 13: W&B init + full training (1,500 steps)
Cell 14: Post-training validation (stratified 65-sample eval)
Cell 15: Save adapter
Cell 16: Results table generation (for reporting)
```

**Run three times, changing only `CURRENT_SEED` in Cell 3.**

---

*V4.2 is the last 0.5B run. Its purpose is not to find more improvement —
it is to know exactly what was found and why, with enough statistical rigor
to say so in writing.*