rtferraz commited on
Commit
d1385b0
·
verified ·
1 Parent(s): 482efc4

Create v4_2-handoff.md

Browse files

(docs): V4.2 is the last 0.5B run. Its purpose is not to find more improvement — it is to know exactly what was found and why, with enough statistical rigor to say so in writing.

Files changed (1) hide show
  1. docs/v4_2-handoff.md +450 -0
docs/v4_2-handoff.md ADDED
@@ -0,0 +1,450 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # V4.2 Handoff — Closing the 0.5B to Gold Standard
2
+
3
+ **Date:** 2026-04-27
4
+ **Context:** V4.1 achieved eval_best=0.645 (+35.5% over V4). Parser fix + constant LR
5
+ were the two decisive changes. SQL Q&A remains stagnant (+3.8%). Insights swing
6
+ (0.84→0.62) is eval noise, not regression. Goal: close remaining gaps before declaring
7
+ the 0.5B scientifically complete and making the methodology case.
8
+
9
+ ---
10
+
11
+ ## What V4.1 Left Open (Ordered by Urgency)
12
+
13
+ | Gap | Evidence | Blocks |
14
+ |-----|----------|--------|
15
+ | Eval suite too small (n≈2 for insights/push) | Insights swings ±0.22 between consecutive evals | Every subsequent claim about per-task performance |
16
+ | SQL Q&A stagnant at +3.8% | Reward function doesn't validate SQL quality | Knowing whether this is reward ceiling or capacity limit |
17
+ | Only 40% of training data seen | train/epoch=0.404 at step 600 | Knowing true performance ceiling |
18
+ | No reward function audit protocol | Parser bug persisted 3 versions | Next parser-class bug catching you again |
19
+ | Single run, no confidence intervals | No error bars on any reported number | Credibility of the methodology case |
20
+
21
+ ---
22
+
23
+ ## V4.2 Changes — Exactly What to Implement
24
+
25
+ ### Change 1: Expand Eval Suite to 50+ Samples (Do Before Training)
26
+
27
+ **What:** Build a static eval set with minimum 15 samples per task type, stratified and
28
+ held fixed across all future runs.
29
+
30
+ ```python
31
+ EVAL_SAMPLES_PER_TASK = {
32
+ "extraction": 20,
33
+ "sql_qa": 15,
34
+ "insights": 15,
35
+ "push": 15,
36
+ }
37
+ # Total: 65 eval samples (was 15 mixed)
38
+ ```
39
+
40
+ **How:** Sample from `data/pairs/eval.jsonl`, verify task distribution, save as
41
+ `data/pairs/eval_v2_stratified.jsonl`. Never resample — the same 65 prompts must be
42
+ used across V4.2 seeds 42, 123, 456.
43
+
44
+ **Why this is Change 1:** The insights regression (0.84→0.62) made it impossible to
45
+ distinguish learning from noise. At n≈2, standard error = ±0.22. At n=15, standard
46
+ error drops to ±0.06. Every other change is uninterpretable without this fixed.
47
+
48
+ **Report format:** `mean ± 1.96 × std/sqrt(n)` for 95% CI on each task.
49
+
50
+ ---
51
+
52
+ ### Change 2: Reward Function Audit Before Training (30-minute Protocol)
53
+
54
+ **What:** Before launching any training cell, generate 20 completions (5 per task),
55
+ manually score them 0-10, compute Spearman ρ between human scores and reward function.
56
+
57
+ ```python
58
+ # Cell X — Reward Audit (add between calibration and training)
59
+ from scipy.stats import spearmanr
60
+
61
+ AUDIT_PROMPTS_PER_TASK = 5
62
+ audit_human_scores = [] # filled manually by the person running the notebook
63
+ audit_auto_scores = []
64
+
65
+ # Generate completions at temp=0.1 (deterministic), print each one
66
+ # Person assigns 0-10 score, enters into audit_human_scores list
67
+ # Then:
68
+ rho, p_value = spearmanr(audit_human_scores, audit_auto_scores)
69
+ print(f"Reward function calibration: ρ={rho:.2f} (p={p_value:.3f})")
70
+ assert rho > 0.7, f"Reward function miscalibrated (ρ={rho:.2f} < 0.70). Fix before training."
71
+ ```
72
+
73
+ **Gate:** ρ > 0.70. If below, find the discrepancy (another parser bug, wrong field
74
+ weighting, wrong task classifier) before spending GPU hours.
75
+
76
+ **Why:** The V1-V4 parser bug would have been caught in 30 minutes with this protocol.
77
+ This is the single cheapest addition to the methodology.
78
+
79
+ ---
80
+
81
+ ### Change 3: SQL Reward Overhaul
82
+
83
+ **What:** Replace heuristic vocabulary matching with a validation-aware reward that
84
+ distinguishes "mentions SQL keywords" from "produces a correct analytical answer."
85
+
86
+ ```python
87
+ def reward_sql_qa_v2(completion: str) -> float:
88
+ answer = strip_think(completion)
89
+ if not answer.strip():
90
+ return 0.0
91
+
92
+ score = 0.0
93
+
94
+ # Tier 1 (0.30): SQL structure detected
95
+ sql_keywords = ["SELECT", "FROM", "WHERE", "GROUP BY", "ORDER BY", "JOIN", "HAVING", "COUNT", "AVG", "SUM"]
96
+ sql_found = sum(1 for kw in sql_keywords if kw in answer.upper())
97
+ if sql_found >= 3:
98
+ score += 0.30
99
+ elif sql_found >= 1:
100
+ score += 0.15
101
+
102
+ # Tier 2 (0.25): Answer has both query and explanation
103
+ has_query = bool(re.search(r"```sql|SELECT.{5,}FROM", answer, re.IGNORECASE | re.DOTALL))
104
+ has_answer = any(w in answer.lower() for w in ["resultado", "total", "média", "mostra", "portanto"])
105
+ if has_query and has_answer:
106
+ score += 0.25
107
+ elif has_query or has_answer:
108
+ score += 0.12
109
+
110
+ # Tier 3 (0.25): Numerical specificity
111
+ numbers = re.findall(r"\d+(?:[.,]\d+)?(?:\s*%)?", answer)
112
+ score += min(0.25, 0.05 * len(numbers))
113
+
114
+ # Tier 4 (0.20): Portuguese business domain coherence
115
+ pt_domain = ["pedidos", "clientes", "vendedores", "produtos", "avaliação",
116
+ "entrega", "reclamação", "satisfação", "categoria", "período"]
117
+ score += min(0.20, 0.04 * sum(1 for w in pt_domain if w in answer.lower()))
118
+
119
+ return min(score, 1.0)
120
+ ```
121
+
122
+ **Why:** The current SQL reward scores a completion that says "os pedidos de clientes em
123
+ 2017 totalizaram 15.000" the same as one that says "SELECT COUNT(*) FROM orders WHERE
124
+ year=2017 → result: 15,000 pedidos". The V2 reward can't tell the difference. The model
125
+ found the easier path (domain vocabulary + numbers) and stopped improving.
126
+
127
+ **Expected outcome:** Two possibilities after the training run:
128
+ - SQL reward improves +20%+ → reward was the ceiling, model has capacity
129
+ - SQL reward still stagnates → confirmed 0.5B capacity limit for analytical reasoning
130
+
131
+ Both are valid scientific conclusions. The point is to distinguish them.
132
+
133
+ ---
134
+
135
+ ### Change 4: Multi-Epoch Training (1,500 Steps)
136
+
137
+ **What:** `MAX_STEPS = 600 → 1500`, training for ~2.5 full epochs with dataset shuffling.
138
+
139
+ ```python
140
+ MAX_STEPS = 1500 # was 600
141
+ SAVE_STEPS = 100 # was 50
142
+ EVAL_STEPS = 50 # was 20, now more frequent per-epoch boundary
143
+ ```
144
+
145
+ **Why:** V4.1 saw only 40% of training data. The eval was still improving at step 500
146
+ (best eval at step 500/600). There is signal remaining in the data. 1,500 steps covers
147
+ the full dataset ~2× with shuffling between passes.
148
+
149
+ **Risk monitoring:** Watch `eval/mean_reward` vs `train/reward` divergence. If eval
150
+ plateaus while train keeps rising past step 800, the model is overfitting. Stop training
151
+ and save the pre-overfitting checkpoint.
152
+
153
+ **Shuffling:** Ensure `shuffle=True` in the DataLoader, or use a different seed per
154
+ epoch for `Dataset.shuffle()`. Seeing the same prompts in the same order each epoch
155
+ eliminates the diversity benefit.
156
+
157
+ **Estimated cost:** ~12 hours on L4. Run overnight.
158
+
159
+ ---
160
+
161
+ ### Change 5: GDPO per-reward normalization
162
+
163
+ ```python
164
+ # GDPO: normalize each reward component separately before summing
165
+ # Instead of: reward = sum(extraction + sql + insights + push) → normalize batch
166
+ # Do: normalize each component independently, then sum
167
+
168
+ def gdpo_normalize(component_rewards: dict[str, list[float]]) -> list[float]:
169
+ """Per-component normalization before aggregation (GDPO 2601.05242 §3.1)."""
170
+ normalized = {}
171
+ for task, rewards in component_rewards.items():
172
+ rewards_t = torch.tensor(rewards, dtype=torch.float32)
173
+ std = rewards_t.std()
174
+ if std > 1e-8:
175
+ normalized[task] = ((rewards_t - rewards_t.mean()) / std).tolist()
176
+ else:
177
+ normalized[task] = [0.0] * len(rewards) # zero-variance group
178
+ # Sum normalized components per sample
179
+ n = len(next(iter(normalized.values())))
180
+ return [sum(normalized[t][i] for t in normalized) for i in range(n)]
181
+ ```
182
+
183
+ This requires `commerce_reward_fn` to return per-component rewards (not just the sum), and
184
+ the trainer to call `gdpo_normalize` before computing advantages. Moderate effort — requires
185
+ a custom trainer subclass.
186
+
187
+ ---
188
+
189
+ ### Change 6: Dynamic Task Weighting (MT-GRPO IWU)
190
+
191
+ **What:** Track per-task reward improvement rates every N eval steps and increase
192
+ sampling probability for stagnating tasks.
193
+
194
+ ```python
195
+ class DynamicTaskSampler:
196
+ """
197
+ MT-GRPO §3.2: Improvement-aware Weight Update.
198
+ Upweights tasks with stagnating reward, downweights converging tasks.
199
+ """
200
+ def __init__(self, tasks, initial_weight=0.25, update_interval=50):
201
+ self.tasks = tasks
202
+ self.weights = {t: initial_weight for t in tasks}
203
+ self.history = {t: [] for t in tasks}
204
+ self.interval = update_interval
205
+
206
+ def update(self, step: int, per_task_rewards: dict):
207
+ if step % self.interval != 0 or step == 0:
208
+ return
209
+ for task, reward in per_task_rewards.items():
210
+ self.history[task].append(reward)
211
+ if len(self.history[task]) >= 2:
212
+ improvement = self.history[task][-1] - self.history[task][-2]
213
+ if improvement < 0.01: # stagnating
214
+ self.weights[task] = min(0.60, self.weights[task] * 1.3)
215
+ elif improvement > 0.05: # improving fast
216
+ self.weights[task] = max(0.10, self.weights[task] * 0.85)
217
+ # Normalize to sum to 1
218
+ total = sum(self.weights.values())
219
+ self.weights = {t: w / total for t, w in self.weights.items()}
220
+
221
+ def sample_indices(self, dataset, n_samples: int) -> list[int]:
222
+ """Sample indices with probability proportional to task weights."""
223
+ task_indices = {t: [] for t in self.tasks}
224
+ for i, record in enumerate(dataset):
225
+ user_txt = " ".join(m["content"] for m in record["prompt"] if m["role"] == "user")
226
+ task = _classify_task_type(user_txt)
227
+ task_indices[task].append(i)
228
+
229
+ sampled = []
230
+ for task, weight in self.weights.items():
231
+ n = max(1, int(n_samples * weight))
232
+ pool = task_indices.get(task, [])
233
+ if pool:
234
+ sampled.extend(random.sample(pool, min(n, len(pool))))
235
+ return sampled
236
+ ```
237
+
238
+ **Why:** SQL Q&A improved +3.8% vs insights +68% under equal weighting. MT-GRPO's
239
+ Theorem 1 proves GRPO allocates gradient budget to higher-variance tasks. SQL gets
240
+ starved. IWU corrects this by progressively increasing SQL's sampling probability as
241
+ its improvement rate drops below threshold.
242
+
243
+ **Log:** `wandb.log({"sampler/sql_weight": sampler.weights["sql_qa"], ...})` at every
244
+ update. You want to see SQL weight increasing from 0.40 (initial) toward 0.60+ if the
245
+ model is still stagnating.
246
+
247
+ ---
248
+
249
+ ### Change 7: Three Seeds for Reproducibility
250
+
251
+ **What:** Run V4.2 with `seed ∈ {42, 123, 456}`. Report `mean ± std` across seeds
252
+ for all headline metrics.
253
+
254
+ ```python
255
+ # In Cell 3:
256
+ SEEDS = [42, 123, 456]
257
+ CURRENT_SEED = 42 # change per run
258
+
259
+ # In GRPOConfig and Dataset prep:
260
+ seed = CURRENT_SEED
261
+ random.seed(seed)
262
+ torch.manual_seed(seed)
263
+ ```
264
+
265
+ **Why:** Three seeds is the minimum for a credible ML result. A single lucky or unlucky
266
+ random initialization can produce ±0.05 variance in eval reward. With three seeds, you
267
+ can report `0.645 ± 0.02` instead of just `0.645` — the former is publishable, the
268
+ latter is an observation.
269
+
270
+ **Cost:** 3 × ~12h = ~36h on L4. Do these in parallel if you have access to multiple
271
+ GPUs, or sequentially overnight.
272
+
273
+ **Report format:**
274
+ ```
275
+ | Task | Seed 42 | Seed 123 | Seed 456 | Mean ± 95% CI |
276
+ |---|---|---|---|---|
277
+ | Extraction | ... | ... | ... | X.XX ± 0.0X |
278
+ | SQL Q&A | ... | ... | ... | X.XX ± 0.0X |
279
+ | Insights | ... | ... | ... | X.XX ± 0.0X |
280
+ | Push | ... | ... | ... | X.XX ± 0.0X |
281
+ | **Mean** | ... | ... | ... | **X.XX ± 0.0X** |
282
+ ```
283
+
284
+ ---
285
+
286
+ ### Change 8: Best checkpoint saving
287
+
288
+ The V4.1 config had `save_total_limit=5` and `save_only_model=True` but no explicit best-checkpoint
289
+ logic. `GRPOTrainer` doesn't have a native `load_best_model_at_end` equivalent the way `Trainer` does.
290
+ The correct pattern is to track best eval reward in `EvalRewardCallback` and save explicitly when
291
+ it improves:
292
+
293
+ ```python
294
+ # In EvalRewardCallback.on_step_end, after computing mean_reward:
295
+ if improved:
296
+ self.best_reward = mean_reward
297
+ self.best_step = state.global_step
298
+ self.no_improve_count = 0
299
+ # Save best checkpoint explicitly
300
+ best_path = ADAPTER_DIR / "best_checkpoint"
301
+ model.save_pretrained(str(best_path))
302
+ tokenizer.save_pretrained(str(best_path))
303
+ print(f" ✓ Best checkpoint saved → {best_path} (reward={mean_reward:.4f})")
304
+ ```
305
+
306
+ This guarantees the adapter saved in `best_checkpoint/` is always the peak eval reward, regardless of where
307
+ training ends or when early stopping fires. Add this to the `EvalRewardCallback` block and the final Cell 15
308
+ should load from `best_checkpoint/` rather than the last training step.
309
+
310
+ ---
311
+
312
+ ## V4.2 Config Summary
313
+
314
+ ```python
315
+ # Changes from V4.1
316
+ MAX_STEPS = 1500 # was 600
317
+ EVAL_STEPS = 50 # was 20
318
+ SAVE_STEPS = 100 # was 50
319
+ SEEDS = [42, 123, 456] # run separately
320
+
321
+ # Structural additions (not hyperparameters)
322
+ eval_set = "data/pairs/eval_v2_stratified.jsonl" # 65 fixed samples
323
+ reward_sql_qa = reward_sql_qa_v2 # new SQL reward (Change 3)
324
+ dynamic_sampler = DynamicTaskSampler(...) # Change 5
325
+
326
+ # Everything else UNCHANGED from V4.1 (these are validated):
327
+ NUM_GENERATIONS = 16
328
+ MAX_COMPLETION_LENGTH = 512
329
+ TEMPERATURE = 1.0
330
+ BETA = 0.0
331
+ SCALE_REWARDS = False
332
+ LEARNING_RATE = 5e-6
333
+ lr_scheduler_type = "constant_with_warmup"
334
+ warmup_ratio = 0.05
335
+ BATCH_SIZE = 2
336
+ GRAD_ACCUM = 1
337
+ LORA_R = 16
338
+ LORA_ALPHA = 32
339
+ ```
340
+
341
+ ---
342
+
343
+ ## What to Observe During V4.2
344
+
345
+ ### The four questions V4.2 must answer
346
+
347
+ **Q1: Does SQL reward improve with the new reward function?**
348
+ Watch `eval/sql_qa` trajectory. If it improves +15%+ over V4.1 by step 600, the old
349
+ reward was the ceiling. If it still stagnates at ~0.547, the 0.5B model has a genuine
350
+ capacity limit on analytical reasoning. Both are valid answers — the important thing is
351
+ knowing which.
352
+
353
+ **Q2: Is the insights regression noise or forgetting?**
354
+ With n=15 insights samples, the swing becomes ±0.06 instead of ±0.22. If insights
355
+ stabilizes at 0.75-0.85 throughout multi-epoch training, it was noise. If it consistently
356
+ drops after step 800, it's catastrophic forgetting — and you need MT-GRPO's constrained
357
+ optimization to address it.
358
+
359
+ **Q3: Does multi-epoch training push eval above 0.70?**
360
+ V4.1 was still improving at step 500 with eval_best=0.645. 1,500 steps of sustained
361
+ training with constant LR and full data coverage should push this further. Target: ≥0.70
362
+ mean across all tasks.
363
+
364
+ **Q4: Are results reproducible across seeds?**
365
+ If `mean ± std` across three seeds shows std < 0.03, the result is robust. If std > 0.05,
366
+ there's significant initialization sensitivity that needs to be understood before claiming
367
+ these numbers.
368
+
369
+ ### WandB metrics to watch in real time
370
+
371
+ | Metric | Expected | Stop if |
372
+ |---|---|---|
373
+ | `eval/sql_qa` | Steeper upward slope vs V4.1 | Still flat at step 200 (reward still wrong) |
374
+ | `eval/insights` | Stable 0.75-0.85, no big swings | Drops consistently below 0.65 after step 800 |
375
+ | `eval/mean_reward` | Continues improving past step 500 | Plateaus at same 0.645 as V4.1 |
376
+ | `sampler/sql_weight` | Rising from 0.40 toward 0.55-0.60 | Stays flat (IWU not triggering) |
377
+ | `train/loss` vs `eval` | Both decreasing | Train drops, eval plateaus (overfitting) |
378
+
379
+ ---
380
+
381
+ ## Drawing Conclusions for the Methodology Case
382
+
383
+ V4.2 closes the 0.5B story. After the runs complete, the project demonstrates:
384
+
385
+ ### What was systematically learned across versions
386
+
387
+ | Version | Failure identified | Fix | Isolated contribution |
388
+ |---|---|---|---|
389
+ | V1 | 68% extraction task imbalance → task collapse | Rebalanced to 40/40/10/10 | Established balanced multi-task training |
390
+ | V2 | DPO near-no-op (9 gradient steps, flat loss) | Replaced DPO with GRPO | Established GRPO as the correct alignment method |
391
+ | V3 | Think model APO anchor + 2,628-token `<think>` overhead | Switched to Instruct model | Isolated architectural constraint |
392
+ | V4 | Cosine LR decay to zero by step 130 | Constant LR schedule | +370 productive training steps |
393
+ | V4 | JSON parser failed on PT-BR decimals | json-repair + PT normalizer | 3.25× measured extraction improvement |
394
+ | V4.1 | Stagnant SQL (+3.8%) | New reward function + dynamic weighting | To be measured in V4.2 |
395
+ | V4.1 | Single run, no CIs | 3-seed reproducibility | Statistical credibility |
396
+
397
+ ### The scientific argument
398
+
399
+ Each version was not "trying something different" — it was **testing a specific hypothesis**
400
+ with a controlled change. The key methodological strength is causal attribution: when
401
+ eval_best went from 0.476 to 0.645, the V4.1 report decomposed the sources:
402
+ - ~0.13 from parser fix (measured at step 20, before GRPO learning)
403
+ - ~0.13 from GRPO learning (measured from step 20 to peak)
404
+
405
+ This is the scientific standard: not just "it improved" but "here is why, with evidence."
406
+
407
+ V4.2 completes this arc by either showing SQL improvement (proving reward was the ceiling)
408
+ or confirming capacity limit (proving 0.5B has an irreducible floor on analytical tasks).
409
+ Either outcome is a clean, documentable finding.
410
+
411
+ ### The external benchmark gap
412
+
413
+ One gap remains for true gold-standard credibility: no external benchmark comparison.
414
+ The current eval suite is project-internal. For the methodology case, consider running
415
+ the V4.2 final checkpoint against a subset of Portuguese NLP benchmarks from the Tucano2
416
+ model card (BLUEX, OAB, ENEM) on the specific knowledge domains your training covered.
417
+ Even partial coverage demonstrates the model didn't regress on general Portuguese ability
418
+ while improving on domain tasks. This is the "catastrophic forgetting" test at the
419
+ benchmark level, complementing the within-run insights stability test.
420
+
421
+ ---
422
+
423
+ ## Notebook Structure for V4.2
424
+
425
+ ```
426
+ Cell 1: Dependencies + env vars
427
+ Cell 2: GPU + Unsloth + TRL verification gate
428
+ Cell 3: Config constants (CURRENT_SEED variable)
429
+ Cell 4: Load model + critical generation_config overrides
430
+ Cell 5: Token ID verification gate
431
+ Cell 6: KV cache diagnostic gate
432
+ Cell 7: Reward functions v2 (including reward_sql_qa_v2)
433
+ Cell 8: Reward function audit (30-min protocol, ρ > 0.70 gate)
434
+ Cell 9: Build stratified eval set (65 samples, fixed)
435
+ Cell 10: Dataset preparation + DynamicTaskSampler init
436
+ Cell 11: Smoke test (1 step, VRAM check)
437
+ Cell 12: Probe run (10 steps, clip_ratio > 0 on 3/10 gate)
438
+ Cell 13: W&B init + full training (1,500 steps)
439
+ Cell 14: Post-training validation (stratified 65-sample eval)
440
+ Cell 15: Save adapter
441
+ Cell 16: Results table generation (for reporting)
442
+ ```
443
+
444
+ **Run three times, changing only `CURRENT_SEED` in Cell 3.**
445
+
446
+ ---
447
+
448
+ *V4.2 is the last 0.5B run. Its purpose is not to find more improvement —
449
+ it is to know exactly what was found and why, with enough statistical rigor
450
+ to say so in writing.*