Create v4_2-handoff.md
Browse files(docs): V4.2 is the last 0.5B run. Its purpose is not to find more improvement — it is to know exactly what was found and why, with enough statistical rigor to say so in writing.
- docs/v4_2-handoff.md +450 -0
docs/v4_2-handoff.md
ADDED
|
@@ -0,0 +1,450 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# V4.2 Handoff — Closing the 0.5B to Gold Standard
|
| 2 |
+
|
| 3 |
+
**Date:** 2026-04-27
|
| 4 |
+
**Context:** V4.1 achieved eval_best=0.645 (+35.5% over V4). Parser fix + constant LR
|
| 5 |
+
were the two decisive changes. SQL Q&A remains stagnant (+3.8%). Insights swing
|
| 6 |
+
(0.84→0.62) is eval noise, not regression. Goal: close remaining gaps before declaring
|
| 7 |
+
the 0.5B scientifically complete and making the methodology case.
|
| 8 |
+
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
## What V4.1 Left Open (Ordered by Urgency)
|
| 12 |
+
|
| 13 |
+
| Gap | Evidence | Blocks |
|
| 14 |
+
|-----|----------|--------|
|
| 15 |
+
| Eval suite too small (n≈2 for insights/push) | Insights swings ±0.22 between consecutive evals | Every subsequent claim about per-task performance |
|
| 16 |
+
| SQL Q&A stagnant at +3.8% | Reward function doesn't validate SQL quality | Knowing whether this is reward ceiling or capacity limit |
|
| 17 |
+
| Only 40% of training data seen | train/epoch=0.404 at step 600 | Knowing true performance ceiling |
|
| 18 |
+
| No reward function audit protocol | Parser bug persisted 3 versions | Next parser-class bug catching you again |
|
| 19 |
+
| Single run, no confidence intervals | No error bars on any reported number | Credibility of the methodology case |
|
| 20 |
+
|
| 21 |
+
---
|
| 22 |
+
|
| 23 |
+
## V4.2 Changes — Exactly What to Implement
|
| 24 |
+
|
| 25 |
+
### Change 1: Expand Eval Suite to 50+ Samples (Do Before Training)
|
| 26 |
+
|
| 27 |
+
**What:** Build a static eval set with minimum 15 samples per task type, stratified and
|
| 28 |
+
held fixed across all future runs.
|
| 29 |
+
|
| 30 |
+
```python
|
| 31 |
+
EVAL_SAMPLES_PER_TASK = {
|
| 32 |
+
"extraction": 20,
|
| 33 |
+
"sql_qa": 15,
|
| 34 |
+
"insights": 15,
|
| 35 |
+
"push": 15,
|
| 36 |
+
}
|
| 37 |
+
# Total: 65 eval samples (was 15 mixed)
|
| 38 |
+
```
|
| 39 |
+
|
| 40 |
+
**How:** Sample from `data/pairs/eval.jsonl`, verify task distribution, save as
|
| 41 |
+
`data/pairs/eval_v2_stratified.jsonl`. Never resample — the same 65 prompts must be
|
| 42 |
+
used across V4.2 seeds 42, 123, 456.
|
| 43 |
+
|
| 44 |
+
**Why this is Change 1:** The insights regression (0.84→0.62) made it impossible to
|
| 45 |
+
distinguish learning from noise. At n≈2, standard error = ±0.22. At n=15, standard
|
| 46 |
+
error drops to ±0.06. Every other change is uninterpretable without this fixed.
|
| 47 |
+
|
| 48 |
+
**Report format:** `mean ± 1.96 × std/sqrt(n)` for 95% CI on each task.
|
| 49 |
+
|
| 50 |
+
---
|
| 51 |
+
|
| 52 |
+
### Change 2: Reward Function Audit Before Training (30-minute Protocol)
|
| 53 |
+
|
| 54 |
+
**What:** Before launching any training cell, generate 20 completions (5 per task),
|
| 55 |
+
manually score them 0-10, compute Spearman ρ between human scores and reward function.
|
| 56 |
+
|
| 57 |
+
```python
|
| 58 |
+
# Cell X — Reward Audit (add between calibration and training)
|
| 59 |
+
from scipy.stats import spearmanr
|
| 60 |
+
|
| 61 |
+
AUDIT_PROMPTS_PER_TASK = 5
|
| 62 |
+
audit_human_scores = [] # filled manually by the person running the notebook
|
| 63 |
+
audit_auto_scores = []
|
| 64 |
+
|
| 65 |
+
# Generate completions at temp=0.1 (deterministic), print each one
|
| 66 |
+
# Person assigns 0-10 score, enters into audit_human_scores list
|
| 67 |
+
# Then:
|
| 68 |
+
rho, p_value = spearmanr(audit_human_scores, audit_auto_scores)
|
| 69 |
+
print(f"Reward function calibration: ρ={rho:.2f} (p={p_value:.3f})")
|
| 70 |
+
assert rho > 0.7, f"Reward function miscalibrated (ρ={rho:.2f} < 0.70). Fix before training."
|
| 71 |
+
```
|
| 72 |
+
|
| 73 |
+
**Gate:** ρ > 0.70. If below, find the discrepancy (another parser bug, wrong field
|
| 74 |
+
weighting, wrong task classifier) before spending GPU hours.
|
| 75 |
+
|
| 76 |
+
**Why:** The V1-V4 parser bug would have been caught in 30 minutes with this protocol.
|
| 77 |
+
This is the single cheapest addition to the methodology.
|
| 78 |
+
|
| 79 |
+
---
|
| 80 |
+
|
| 81 |
+
### Change 3: SQL Reward Overhaul
|
| 82 |
+
|
| 83 |
+
**What:** Replace heuristic vocabulary matching with a validation-aware reward that
|
| 84 |
+
distinguishes "mentions SQL keywords" from "produces a correct analytical answer."
|
| 85 |
+
|
| 86 |
+
```python
|
| 87 |
+
def reward_sql_qa_v2(completion: str) -> float:
|
| 88 |
+
answer = strip_think(completion)
|
| 89 |
+
if not answer.strip():
|
| 90 |
+
return 0.0
|
| 91 |
+
|
| 92 |
+
score = 0.0
|
| 93 |
+
|
| 94 |
+
# Tier 1 (0.30): SQL structure detected
|
| 95 |
+
sql_keywords = ["SELECT", "FROM", "WHERE", "GROUP BY", "ORDER BY", "JOIN", "HAVING", "COUNT", "AVG", "SUM"]
|
| 96 |
+
sql_found = sum(1 for kw in sql_keywords if kw in answer.upper())
|
| 97 |
+
if sql_found >= 3:
|
| 98 |
+
score += 0.30
|
| 99 |
+
elif sql_found >= 1:
|
| 100 |
+
score += 0.15
|
| 101 |
+
|
| 102 |
+
# Tier 2 (0.25): Answer has both query and explanation
|
| 103 |
+
has_query = bool(re.search(r"```sql|SELECT.{5,}FROM", answer, re.IGNORECASE | re.DOTALL))
|
| 104 |
+
has_answer = any(w in answer.lower() for w in ["resultado", "total", "média", "mostra", "portanto"])
|
| 105 |
+
if has_query and has_answer:
|
| 106 |
+
score += 0.25
|
| 107 |
+
elif has_query or has_answer:
|
| 108 |
+
score += 0.12
|
| 109 |
+
|
| 110 |
+
# Tier 3 (0.25): Numerical specificity
|
| 111 |
+
numbers = re.findall(r"\d+(?:[.,]\d+)?(?:\s*%)?", answer)
|
| 112 |
+
score += min(0.25, 0.05 * len(numbers))
|
| 113 |
+
|
| 114 |
+
# Tier 4 (0.20): Portuguese business domain coherence
|
| 115 |
+
pt_domain = ["pedidos", "clientes", "vendedores", "produtos", "avaliação",
|
| 116 |
+
"entrega", "reclamação", "satisfação", "categoria", "período"]
|
| 117 |
+
score += min(0.20, 0.04 * sum(1 for w in pt_domain if w in answer.lower()))
|
| 118 |
+
|
| 119 |
+
return min(score, 1.0)
|
| 120 |
+
```
|
| 121 |
+
|
| 122 |
+
**Why:** The current SQL reward scores a completion that says "os pedidos de clientes em
|
| 123 |
+
2017 totalizaram 15.000" the same as one that says "SELECT COUNT(*) FROM orders WHERE
|
| 124 |
+
year=2017 → result: 15,000 pedidos". The V2 reward can't tell the difference. The model
|
| 125 |
+
found the easier path (domain vocabulary + numbers) and stopped improving.
|
| 126 |
+
|
| 127 |
+
**Expected outcome:** Two possibilities after the training run:
|
| 128 |
+
- SQL reward improves +20%+ → reward was the ceiling, model has capacity
|
| 129 |
+
- SQL reward still stagnates → confirmed 0.5B capacity limit for analytical reasoning
|
| 130 |
+
|
| 131 |
+
Both are valid scientific conclusions. The point is to distinguish them.
|
| 132 |
+
|
| 133 |
+
---
|
| 134 |
+
|
| 135 |
+
### Change 4: Multi-Epoch Training (1,500 Steps)
|
| 136 |
+
|
| 137 |
+
**What:** `MAX_STEPS = 600 → 1500`, training for ~2.5 full epochs with dataset shuffling.
|
| 138 |
+
|
| 139 |
+
```python
|
| 140 |
+
MAX_STEPS = 1500 # was 600
|
| 141 |
+
SAVE_STEPS = 100 # was 50
|
| 142 |
+
EVAL_STEPS = 50 # was 20, now more frequent per-epoch boundary
|
| 143 |
+
```
|
| 144 |
+
|
| 145 |
+
**Why:** V4.1 saw only 40% of training data. The eval was still improving at step 500
|
| 146 |
+
(best eval at step 500/600). There is signal remaining in the data. 1,500 steps covers
|
| 147 |
+
the full dataset ~2× with shuffling between passes.
|
| 148 |
+
|
| 149 |
+
**Risk monitoring:** Watch `eval/mean_reward` vs `train/reward` divergence. If eval
|
| 150 |
+
plateaus while train keeps rising past step 800, the model is overfitting. Stop training
|
| 151 |
+
and save the pre-overfitting checkpoint.
|
| 152 |
+
|
| 153 |
+
**Shuffling:** Ensure `shuffle=True` in the DataLoader, or use a different seed per
|
| 154 |
+
epoch for `Dataset.shuffle()`. Seeing the same prompts in the same order each epoch
|
| 155 |
+
eliminates the diversity benefit.
|
| 156 |
+
|
| 157 |
+
**Estimated cost:** ~12 hours on L4. Run overnight.
|
| 158 |
+
|
| 159 |
+
---
|
| 160 |
+
|
| 161 |
+
### Change 5: GDPO per-reward normalization
|
| 162 |
+
|
| 163 |
+
```python
|
| 164 |
+
# GDPO: normalize each reward component separately before summing
|
| 165 |
+
# Instead of: reward = sum(extraction + sql + insights + push) → normalize batch
|
| 166 |
+
# Do: normalize each component independently, then sum
|
| 167 |
+
|
| 168 |
+
def gdpo_normalize(component_rewards: dict[str, list[float]]) -> list[float]:
|
| 169 |
+
"""Per-component normalization before aggregation (GDPO 2601.05242 §3.1)."""
|
| 170 |
+
normalized = {}
|
| 171 |
+
for task, rewards in component_rewards.items():
|
| 172 |
+
rewards_t = torch.tensor(rewards, dtype=torch.float32)
|
| 173 |
+
std = rewards_t.std()
|
| 174 |
+
if std > 1e-8:
|
| 175 |
+
normalized[task] = ((rewards_t - rewards_t.mean()) / std).tolist()
|
| 176 |
+
else:
|
| 177 |
+
normalized[task] = [0.0] * len(rewards) # zero-variance group
|
| 178 |
+
# Sum normalized components per sample
|
| 179 |
+
n = len(next(iter(normalized.values())))
|
| 180 |
+
return [sum(normalized[t][i] for t in normalized) for i in range(n)]
|
| 181 |
+
```
|
| 182 |
+
|
| 183 |
+
This requires `commerce_reward_fn` to return per-component rewards (not just the sum), and
|
| 184 |
+
the trainer to call `gdpo_normalize` before computing advantages. Moderate effort — requires
|
| 185 |
+
a custom trainer subclass.
|
| 186 |
+
|
| 187 |
+
---
|
| 188 |
+
|
| 189 |
+
### Change 6: Dynamic Task Weighting (MT-GRPO IWU)
|
| 190 |
+
|
| 191 |
+
**What:** Track per-task reward improvement rates every N eval steps and increase
|
| 192 |
+
sampling probability for stagnating tasks.
|
| 193 |
+
|
| 194 |
+
```python
|
| 195 |
+
class DynamicTaskSampler:
|
| 196 |
+
"""
|
| 197 |
+
MT-GRPO §3.2: Improvement-aware Weight Update.
|
| 198 |
+
Upweights tasks with stagnating reward, downweights converging tasks.
|
| 199 |
+
"""
|
| 200 |
+
def __init__(self, tasks, initial_weight=0.25, update_interval=50):
|
| 201 |
+
self.tasks = tasks
|
| 202 |
+
self.weights = {t: initial_weight for t in tasks}
|
| 203 |
+
self.history = {t: [] for t in tasks}
|
| 204 |
+
self.interval = update_interval
|
| 205 |
+
|
| 206 |
+
def update(self, step: int, per_task_rewards: dict):
|
| 207 |
+
if step % self.interval != 0 or step == 0:
|
| 208 |
+
return
|
| 209 |
+
for task, reward in per_task_rewards.items():
|
| 210 |
+
self.history[task].append(reward)
|
| 211 |
+
if len(self.history[task]) >= 2:
|
| 212 |
+
improvement = self.history[task][-1] - self.history[task][-2]
|
| 213 |
+
if improvement < 0.01: # stagnating
|
| 214 |
+
self.weights[task] = min(0.60, self.weights[task] * 1.3)
|
| 215 |
+
elif improvement > 0.05: # improving fast
|
| 216 |
+
self.weights[task] = max(0.10, self.weights[task] * 0.85)
|
| 217 |
+
# Normalize to sum to 1
|
| 218 |
+
total = sum(self.weights.values())
|
| 219 |
+
self.weights = {t: w / total for t, w in self.weights.items()}
|
| 220 |
+
|
| 221 |
+
def sample_indices(self, dataset, n_samples: int) -> list[int]:
|
| 222 |
+
"""Sample indices with probability proportional to task weights."""
|
| 223 |
+
task_indices = {t: [] for t in self.tasks}
|
| 224 |
+
for i, record in enumerate(dataset):
|
| 225 |
+
user_txt = " ".join(m["content"] for m in record["prompt"] if m["role"] == "user")
|
| 226 |
+
task = _classify_task_type(user_txt)
|
| 227 |
+
task_indices[task].append(i)
|
| 228 |
+
|
| 229 |
+
sampled = []
|
| 230 |
+
for task, weight in self.weights.items():
|
| 231 |
+
n = max(1, int(n_samples * weight))
|
| 232 |
+
pool = task_indices.get(task, [])
|
| 233 |
+
if pool:
|
| 234 |
+
sampled.extend(random.sample(pool, min(n, len(pool))))
|
| 235 |
+
return sampled
|
| 236 |
+
```
|
| 237 |
+
|
| 238 |
+
**Why:** SQL Q&A improved +3.8% vs insights +68% under equal weighting. MT-GRPO's
|
| 239 |
+
Theorem 1 proves GRPO allocates gradient budget to higher-variance tasks. SQL gets
|
| 240 |
+
starved. IWU corrects this by progressively increasing SQL's sampling probability as
|
| 241 |
+
its improvement rate drops below threshold.
|
| 242 |
+
|
| 243 |
+
**Log:** `wandb.log({"sampler/sql_weight": sampler.weights["sql_qa"], ...})` at every
|
| 244 |
+
update. You want to see SQL weight increasing from 0.40 (initial) toward 0.60+ if the
|
| 245 |
+
model is still stagnating.
|
| 246 |
+
|
| 247 |
+
---
|
| 248 |
+
|
| 249 |
+
### Change 7: Three Seeds for Reproducibility
|
| 250 |
+
|
| 251 |
+
**What:** Run V4.2 with `seed ∈ {42, 123, 456}`. Report `mean ± std` across seeds
|
| 252 |
+
for all headline metrics.
|
| 253 |
+
|
| 254 |
+
```python
|
| 255 |
+
# In Cell 3:
|
| 256 |
+
SEEDS = [42, 123, 456]
|
| 257 |
+
CURRENT_SEED = 42 # change per run
|
| 258 |
+
|
| 259 |
+
# In GRPOConfig and Dataset prep:
|
| 260 |
+
seed = CURRENT_SEED
|
| 261 |
+
random.seed(seed)
|
| 262 |
+
torch.manual_seed(seed)
|
| 263 |
+
```
|
| 264 |
+
|
| 265 |
+
**Why:** Three seeds is the minimum for a credible ML result. A single lucky or unlucky
|
| 266 |
+
random initialization can produce ±0.05 variance in eval reward. With three seeds, you
|
| 267 |
+
can report `0.645 ± 0.02` instead of just `0.645` — the former is publishable, the
|
| 268 |
+
latter is an observation.
|
| 269 |
+
|
| 270 |
+
**Cost:** 3 × ~12h = ~36h on L4. Do these in parallel if you have access to multiple
|
| 271 |
+
GPUs, or sequentially overnight.
|
| 272 |
+
|
| 273 |
+
**Report format:**
|
| 274 |
+
```
|
| 275 |
+
| Task | Seed 42 | Seed 123 | Seed 456 | Mean ± 95% CI |
|
| 276 |
+
|---|---|---|---|---|
|
| 277 |
+
| Extraction | ... | ... | ... | X.XX ± 0.0X |
|
| 278 |
+
| SQL Q&A | ... | ... | ... | X.XX ± 0.0X |
|
| 279 |
+
| Insights | ... | ... | ... | X.XX ± 0.0X |
|
| 280 |
+
| Push | ... | ... | ... | X.XX ± 0.0X |
|
| 281 |
+
| **Mean** | ... | ... | ... | **X.XX ± 0.0X** |
|
| 282 |
+
```
|
| 283 |
+
|
| 284 |
+
---
|
| 285 |
+
|
| 286 |
+
### Change 8: Best checkpoint saving
|
| 287 |
+
|
| 288 |
+
The V4.1 config had `save_total_limit=5` and `save_only_model=True` but no explicit best-checkpoint
|
| 289 |
+
logic. `GRPOTrainer` doesn't have a native `load_best_model_at_end` equivalent the way `Trainer` does.
|
| 290 |
+
The correct pattern is to track best eval reward in `EvalRewardCallback` and save explicitly when
|
| 291 |
+
it improves:
|
| 292 |
+
|
| 293 |
+
```python
|
| 294 |
+
# In EvalRewardCallback.on_step_end, after computing mean_reward:
|
| 295 |
+
if improved:
|
| 296 |
+
self.best_reward = mean_reward
|
| 297 |
+
self.best_step = state.global_step
|
| 298 |
+
self.no_improve_count = 0
|
| 299 |
+
# Save best checkpoint explicitly
|
| 300 |
+
best_path = ADAPTER_DIR / "best_checkpoint"
|
| 301 |
+
model.save_pretrained(str(best_path))
|
| 302 |
+
tokenizer.save_pretrained(str(best_path))
|
| 303 |
+
print(f" ✓ Best checkpoint saved → {best_path} (reward={mean_reward:.4f})")
|
| 304 |
+
```
|
| 305 |
+
|
| 306 |
+
This guarantees the adapter saved in `best_checkpoint/` is always the peak eval reward, regardless of where
|
| 307 |
+
training ends or when early stopping fires. Add this to the `EvalRewardCallback` block and the final Cell 15
|
| 308 |
+
should load from `best_checkpoint/` rather than the last training step.
|
| 309 |
+
|
| 310 |
+
---
|
| 311 |
+
|
| 312 |
+
## V4.2 Config Summary
|
| 313 |
+
|
| 314 |
+
```python
|
| 315 |
+
# Changes from V4.1
|
| 316 |
+
MAX_STEPS = 1500 # was 600
|
| 317 |
+
EVAL_STEPS = 50 # was 20
|
| 318 |
+
SAVE_STEPS = 100 # was 50
|
| 319 |
+
SEEDS = [42, 123, 456] # run separately
|
| 320 |
+
|
| 321 |
+
# Structural additions (not hyperparameters)
|
| 322 |
+
eval_set = "data/pairs/eval_v2_stratified.jsonl" # 65 fixed samples
|
| 323 |
+
reward_sql_qa = reward_sql_qa_v2 # new SQL reward (Change 3)
|
| 324 |
+
dynamic_sampler = DynamicTaskSampler(...) # Change 5
|
| 325 |
+
|
| 326 |
+
# Everything else UNCHANGED from V4.1 (these are validated):
|
| 327 |
+
NUM_GENERATIONS = 16
|
| 328 |
+
MAX_COMPLETION_LENGTH = 512
|
| 329 |
+
TEMPERATURE = 1.0
|
| 330 |
+
BETA = 0.0
|
| 331 |
+
SCALE_REWARDS = False
|
| 332 |
+
LEARNING_RATE = 5e-6
|
| 333 |
+
lr_scheduler_type = "constant_with_warmup"
|
| 334 |
+
warmup_ratio = 0.05
|
| 335 |
+
BATCH_SIZE = 2
|
| 336 |
+
GRAD_ACCUM = 1
|
| 337 |
+
LORA_R = 16
|
| 338 |
+
LORA_ALPHA = 32
|
| 339 |
+
```
|
| 340 |
+
|
| 341 |
+
---
|
| 342 |
+
|
| 343 |
+
## What to Observe During V4.2
|
| 344 |
+
|
| 345 |
+
### The four questions V4.2 must answer
|
| 346 |
+
|
| 347 |
+
**Q1: Does SQL reward improve with the new reward function?**
|
| 348 |
+
Watch `eval/sql_qa` trajectory. If it improves +15%+ over V4.1 by step 600, the old
|
| 349 |
+
reward was the ceiling. If it still stagnates at ~0.547, the 0.5B model has a genuine
|
| 350 |
+
capacity limit on analytical reasoning. Both are valid answers — the important thing is
|
| 351 |
+
knowing which.
|
| 352 |
+
|
| 353 |
+
**Q2: Is the insights regression noise or forgetting?**
|
| 354 |
+
With n=15 insights samples, the swing becomes ±0.06 instead of ±0.22. If insights
|
| 355 |
+
stabilizes at 0.75-0.85 throughout multi-epoch training, it was noise. If it consistently
|
| 356 |
+
drops after step 800, it's catastrophic forgetting — and you need MT-GRPO's constrained
|
| 357 |
+
optimization to address it.
|
| 358 |
+
|
| 359 |
+
**Q3: Does multi-epoch training push eval above 0.70?**
|
| 360 |
+
V4.1 was still improving at step 500 with eval_best=0.645. 1,500 steps of sustained
|
| 361 |
+
training with constant LR and full data coverage should push this further. Target: ≥0.70
|
| 362 |
+
mean across all tasks.
|
| 363 |
+
|
| 364 |
+
**Q4: Are results reproducible across seeds?**
|
| 365 |
+
If `mean ± std` across three seeds shows std < 0.03, the result is robust. If std > 0.05,
|
| 366 |
+
there's significant initialization sensitivity that needs to be understood before claiming
|
| 367 |
+
these numbers.
|
| 368 |
+
|
| 369 |
+
### WandB metrics to watch in real time
|
| 370 |
+
|
| 371 |
+
| Metric | Expected | Stop if |
|
| 372 |
+
|---|---|---|
|
| 373 |
+
| `eval/sql_qa` | Steeper upward slope vs V4.1 | Still flat at step 200 (reward still wrong) |
|
| 374 |
+
| `eval/insights` | Stable 0.75-0.85, no big swings | Drops consistently below 0.65 after step 800 |
|
| 375 |
+
| `eval/mean_reward` | Continues improving past step 500 | Plateaus at same 0.645 as V4.1 |
|
| 376 |
+
| `sampler/sql_weight` | Rising from 0.40 toward 0.55-0.60 | Stays flat (IWU not triggering) |
|
| 377 |
+
| `train/loss` vs `eval` | Both decreasing | Train drops, eval plateaus (overfitting) |
|
| 378 |
+
|
| 379 |
+
---
|
| 380 |
+
|
| 381 |
+
## Drawing Conclusions for the Methodology Case
|
| 382 |
+
|
| 383 |
+
V4.2 closes the 0.5B story. After the runs complete, the project demonstrates:
|
| 384 |
+
|
| 385 |
+
### What was systematically learned across versions
|
| 386 |
+
|
| 387 |
+
| Version | Failure identified | Fix | Isolated contribution |
|
| 388 |
+
|---|---|---|---|
|
| 389 |
+
| V1 | 68% extraction task imbalance → task collapse | Rebalanced to 40/40/10/10 | Established balanced multi-task training |
|
| 390 |
+
| V2 | DPO near-no-op (9 gradient steps, flat loss) | Replaced DPO with GRPO | Established GRPO as the correct alignment method |
|
| 391 |
+
| V3 | Think model APO anchor + 2,628-token `<think>` overhead | Switched to Instruct model | Isolated architectural constraint |
|
| 392 |
+
| V4 | Cosine LR decay to zero by step 130 | Constant LR schedule | +370 productive training steps |
|
| 393 |
+
| V4 | JSON parser failed on PT-BR decimals | json-repair + PT normalizer | 3.25× measured extraction improvement |
|
| 394 |
+
| V4.1 | Stagnant SQL (+3.8%) | New reward function + dynamic weighting | To be measured in V4.2 |
|
| 395 |
+
| V4.1 | Single run, no CIs | 3-seed reproducibility | Statistical credibility |
|
| 396 |
+
|
| 397 |
+
### The scientific argument
|
| 398 |
+
|
| 399 |
+
Each version was not "trying something different" — it was **testing a specific hypothesis**
|
| 400 |
+
with a controlled change. The key methodological strength is causal attribution: when
|
| 401 |
+
eval_best went from 0.476 to 0.645, the V4.1 report decomposed the sources:
|
| 402 |
+
- ~0.13 from parser fix (measured at step 20, before GRPO learning)
|
| 403 |
+
- ~0.13 from GRPO learning (measured from step 20 to peak)
|
| 404 |
+
|
| 405 |
+
This is the scientific standard: not just "it improved" but "here is why, with evidence."
|
| 406 |
+
|
| 407 |
+
V4.2 completes this arc by either showing SQL improvement (proving reward was the ceiling)
|
| 408 |
+
or confirming capacity limit (proving 0.5B has an irreducible floor on analytical tasks).
|
| 409 |
+
Either outcome is a clean, documentable finding.
|
| 410 |
+
|
| 411 |
+
### The external benchmark gap
|
| 412 |
+
|
| 413 |
+
One gap remains for true gold-standard credibility: no external benchmark comparison.
|
| 414 |
+
The current eval suite is project-internal. For the methodology case, consider running
|
| 415 |
+
the V4.2 final checkpoint against a subset of Portuguese NLP benchmarks from the Tucano2
|
| 416 |
+
model card (BLUEX, OAB, ENEM) on the specific knowledge domains your training covered.
|
| 417 |
+
Even partial coverage demonstrates the model didn't regress on general Portuguese ability
|
| 418 |
+
while improving on domain tasks. This is the "catastrophic forgetting" test at the
|
| 419 |
+
benchmark level, complementing the within-run insights stability test.
|
| 420 |
+
|
| 421 |
+
---
|
| 422 |
+
|
| 423 |
+
## Notebook Structure for V4.2
|
| 424 |
+
|
| 425 |
+
```
|
| 426 |
+
Cell 1: Dependencies + env vars
|
| 427 |
+
Cell 2: GPU + Unsloth + TRL verification gate
|
| 428 |
+
Cell 3: Config constants (CURRENT_SEED variable)
|
| 429 |
+
Cell 4: Load model + critical generation_config overrides
|
| 430 |
+
Cell 5: Token ID verification gate
|
| 431 |
+
Cell 6: KV cache diagnostic gate
|
| 432 |
+
Cell 7: Reward functions v2 (including reward_sql_qa_v2)
|
| 433 |
+
Cell 8: Reward function audit (30-min protocol, ρ > 0.70 gate)
|
| 434 |
+
Cell 9: Build stratified eval set (65 samples, fixed)
|
| 435 |
+
Cell 10: Dataset preparation + DynamicTaskSampler init
|
| 436 |
+
Cell 11: Smoke test (1 step, VRAM check)
|
| 437 |
+
Cell 12: Probe run (10 steps, clip_ratio > 0 on 3/10 gate)
|
| 438 |
+
Cell 13: W&B init + full training (1,500 steps)
|
| 439 |
+
Cell 14: Post-training validation (stratified 65-sample eval)
|
| 440 |
+
Cell 15: Save adapter
|
| 441 |
+
Cell 16: Results table generation (for reporting)
|
| 442 |
+
```
|
| 443 |
+
|
| 444 |
+
**Run three times, changing only `CURRENT_SEED` in Cell 3.**
|
| 445 |
+
|
| 446 |
+
---
|
| 447 |
+
|
| 448 |
+
*V4.2 is the last 0.5B run. Its purpose is not to find more improvement —
|
| 449 |
+
it is to know exactly what was found and why, with enough statistical rigor
|
| 450 |
+
to say so in writing.*
|