rtferraz commited on
Commit
cfaf49c
·
verified ·
1 Parent(s): 521e1d8

docs: add V4 run assessment with lessons learned and improvement roadmap

Browse files
Files changed (1) hide show
  1. docs/v4_run_assessment.md +526 -0
docs/v4_run_assessment.md ADDED
@@ -0,0 +1,526 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # V4 Run Assessment — GRPO on Tucano2-qwen-0.5B-Instruct
2
+
3
+ **Date:** 2026-04-27
4
+ **Run:** `grpo-v4-instruct-0.5B` on Vertex AI (NVIDIA L4, 24GB)
5
+ **Duration:** 2.47 hours, 200/200 steps completed
6
+ **Reference:** `docs/ADR-002-v4-instruct.md`
7
+
8
+ ---
9
+
10
+ ## Table of Contents
11
+
12
+ 1. [Context](#1-context)
13
+ 2. [Raw Results](#2-raw-results)
14
+ 3. [The Good](#3-the-good)
15
+ 4. [The Bad](#4-the-bad)
16
+ 5. [The Ugly](#5-the-ugly)
17
+ 6. [Decisions: Why clip_ratio=0 Is Expected (Revised Understanding)](#6-decisions-why-clip_ratio0-is-expected)
18
+ 7. [Areas of Improvement at 0.5B Scale](#7-areas-of-improvement-at-05b-scale)
19
+ 8. [Summary & Next Steps](#8-summary--next-steps)
20
+ 9. [Paper References](#9-paper-references)
21
+
22
+ ---
23
+
24
+ ## 1. Context
25
+
26
+ ### 1.1 What V4 Was Designed To Prove
27
+
28
+ V4 was the validation experiment. After three GRPO runs on `Polygl0t/Tucano2-qwen-3.7B-Think` all showed `clip_ratio=0` and no policy movement, V4 asked: **Is the problem the model, the scale, or the algorithm?**
29
+
30
+ Specifically, V4 tested:
31
+ - **Instruct vs Think model** — does eliminating `<think>` overhead fix the completion ceiling?
32
+ - **0.5B vs 3.7B** — does a smaller model produce larger per-parameter gradients?
33
+ - **G=16 vs G=4** — does more reward variance per group enable learning?
34
+ - **Hard probe gate** — can we detect failure early and not waste 25h?
35
+
36
+ ### 1.2 The Pipeline So Far
37
+
38
+ ```
39
+ V1 (3.7B-Think) → Dead: temp=0.1 default → zero signal. Killed early.
40
+ V2 (3.7B-Think) → 210 steps, +42% over SFT (0.38→0.54). But clip_ratio=0, KL=0.004.
41
+ V3 (3.7B-Think) → 171 steps. All fixes from ADR-001. clip_ratio=0, reward=0.87 (SFT baseline).
42
+ V4 (0.5B-Instruct) → 200 steps, 2.47h. eval_best=0.476 at step 130. ← THIS RUN
43
+ ```
44
+
45
+ ### 1.3 V4 Configuration
46
+
47
+ | Parameter | Value | Rationale |
48
+ |-----------|-------|-----------|
49
+ | Model | `Polygl0t/Tucano2-qwen-0.5B-Instruct` | No `<think>` overhead, 2× better NPM than Think variant |
50
+ | LoRA | r=16, α=32, standard PEFT (not Unsloth fused) | Unsloth fused kernels had bf16 dtype bug |
51
+ | G (num_generations) | 16 | 0.5B fits G=16; more rollouts = more reward variance |
52
+ | max_completion_length | 512 | No think overhead → extraction ~100, SQL ~200, insights ~300 |
53
+ | Temperature | 1.0 | Skywork-OR1 (2505.22312): τ=1.0 for max exploration |
54
+ | Learning rate | 2e-6 | Dr. GRPO Appendix G |
55
+ | β (KL penalty) | 0.0 | Dr. GRPO §3.2: unnecessary for rule-based rewards |
56
+ | scale_rewards | False | Dr. GRPO §3.1: remove std normalization bias |
57
+ | Batch size | 2 × grad_accum=1 | Effective batch = 2 prompts × 16 gen = 32 completions |
58
+ | Max steps | 200 | Validation run |
59
+ | Training data | ~1,650 prompts (90/10 split) | Full dataset with task-aware system prompts |
60
+ | generation_config overrides | temp=1.0, rep_penalty=1.0, top_k=0, top_p=1.0, use_cache=True | 6 critical overrides from ADR-002 |
61
+
62
+ ---
63
+
64
+ ## 2. Raw Results
65
+
66
+ ### 2.1 Final W&B Metrics
67
+
68
+ ```json
69
+ {
70
+ "train/final_loss": 0.0134,
71
+ "train/total_steps": 200,
72
+ "train/duration_hours": 2.47,
73
+ "train/reward": 0.497,
74
+ "train/reward_std": 0.116,
75
+ "train/frac_reward_zero_std": 0.0,
76
+ "train/completion_length": 162.8,
77
+ "train/kl": 0.0,
78
+ "train/grad_norm": 0.065,
79
+ "train/clip_ratio/high_max": 0.0,
80
+ "train/clip_ratio/high_mean": 0.0,
81
+ "train/clip_ratio/low_mean": 0.0,
82
+ "train/clip_ratio/low_min": 0.0,
83
+ "train/clip_ratio/region_mean": 0.0,
84
+ "eval/best_reward_final": 0.476,
85
+ "eval/best_step": 130
86
+ }
87
+ ```
88
+
89
+ ### 2.2 Training Trajectory (from W&B logs)
90
+
91
+ | Phase | Steps | Eval Reward | Train Reward | Notes |
92
+ |-------|-------|-------------|--------------|-------|
93
+ | Warmup | 1-10 | 0.357 | 0.35-0.50 | LR ramping, reward oscillating |
94
+ | Learning | 10-50 | 0.357→0.416 | 0.40-0.55 | Eval improving, clear signal |
95
+ | Refinement | 50-130 | 0.416→0.476 | 0.45-0.55 | Best eval at step 130 |
96
+ | Plateau | 130-200 | 0.476 (flat) | 0.45-0.50 | LR decaying (cosine), no improvement |
97
+
98
+ ---
99
+
100
+ ## 3. The Good
101
+
102
+ ### 3.1 ✅ Instruct Model Eliminates `<think>` Problem
103
+
104
+ **V3 completion_length: 2,628 tokens (100% ceiling at 4096)**
105
+ **V4 completion_length: 163 tokens (0% ceiling at 512)**
106
+
107
+ This is the single most important architectural win. The Instruct model's chat template has no `<think>` injection — the assistant message is just `{content}`. Completions for extraction are ~100 tokens, SQL ~200, insights ~300. No token budget is wasted on reasoning preamble.
108
+
109
+ **Evidence:** V3's extraction reward was 0.12 because the model spent 2000-3000 tokens on `<think>` and never produced JSON. V4's extraction actually produces JSON output.
110
+
111
+ ### 3.2 ✅ Perfect Signal Quality: frac_reward_zero_std = 0.0
112
+
113
+ Every single batch across 200 steps had reward variance. No zero-variance groups. This is the ideal state for GRPO — every batch produces useful gradient signal.
114
+
115
+ **Context:** Most math-reasoning GRPO papers struggle with 30-99% zero-variance batches (RL-ZVP, 2509.21880). V2 had ~10% after the continuous reward fix. V4 has zero. The reward engineering is working perfectly.
116
+
117
+ ### 3.3 ✅ Eval Improvement Is Real: 0.357 → 0.476
118
+
119
+ +33% improvement on held-out data over 130 steps. This is measured on prompts the model never saw during training, using deterministic eval (temp=0.1). The learning is genuine, not just noise.
120
+
121
+ **Comparison to prior art:** Extract-0 (2509.22906) — a 7B model doing GRPO for document extraction — showed +13% improvement from GRPO (0.507 → 0.573). V4 achieved +33% on a 0.5B model. The relative gain is larger, though from a lower starting point.
122
+
123
+ ### 3.4 ✅ Speed and Efficiency
124
+
125
+ - **44 seconds/step** — ~8× faster than V3 (180-420s/step)
126
+ - **2.47 hours total** — vs 14.9h (V2) and ~25h estimated (V3)
127
+ - **Peak VRAM well within budget** — L4's 24GB with massive headroom
128
+
129
+ This means we can iterate quickly: test a hyperparameter change in ~2.5h instead of waiting a full day.
130
+
131
+ ### 3.5 ✅ Reward Function Produces Gradient Signal
132
+
133
+ `reward_std = 0.116` consistently across training. G=16 generates diverse enough completions that every group has clear winners and losers. The composite reward function (extraction + SQL + insights + push) discriminates between good and bad completions effectively.
134
+
135
+ ### 3.6 ✅ V4 Validated the Recipe
136
+
137
+ The combination of Instruct model + GRPO + task-aware prompts + continuous rewards + G=16 + 512 completion length **works**. The recipe is transferable: these qualitative findings apply to any Tucano2-Instruct model at any scale.
138
+
139
+ ---
140
+
141
+ ## 4. The Bad
142
+
143
+ ### 4.1 ⚠️ Eval Plateaued at 0.476 After Step 130
144
+
145
+ 70 steps of training (130→200) produced no eval improvement. The cosine LR schedule was decaying through this phase, so the learning rate was approaching zero. The model hit a performance ceiling.
146
+
147
+ **Possible causes:**
148
+ 1. **LR decay killed momentum** — cosine schedule reaches near-zero by step 200. The model may have had more room to learn but ran out of gradient magnitude.
149
+ 2. **0.5B capacity limit** — 490M parameters may genuinely cap out at ~0.48 eval reward for this task complexity.
150
+ 3. **Data exhaustion** — at 200 steps × 2 prompts/step = 400 prompts seen. With 1,480 training prompts, the model saw ~27% of the data once. Some tasks may be underrepresented in the batches seen.
151
+ 4. **Reward function ceiling** — the reward function's heuristics (keyword matching, length checks, JSON schema) may not discriminate finely enough at higher performance levels.
152
+
153
+ ### 4.2 ⚠️ Only 13.5% of One Epoch Completed
154
+
155
+ `train/epoch = 0.135` — the model trained on barely 13% of the available data. With 200 steps at effective batch size 2, only ~400 of ~1,480 training prompts were sampled. Some prompts were never seen, and no prompt was seen twice.
156
+
157
+ This means:
158
+ - Multi-epoch training could expose the model to the full dataset
159
+ - Task types with fewer samples (insights=114, push=222) may have been undersampled
160
+ - The model hasn't had the chance to reinforce learning on difficult examples
161
+
162
+ ### 4.3 ⚠️ Extraction Still Limited by JSON Parser
163
+
164
+ V4's extraction reward (~0.173 from earlier calibration data) beats V3's (0.12), but remains the weakest task. The root cause is the JSON parser, not the model's generation quality. Portuguese decimal comma formatting (`4,5` instead of `4.5`) causes `json.loads()` to fail, scoring valid-but-unparseable output as near-zero.
165
+
166
+ **Impact:** The model is being penalized for correct Portuguese localization behavior. A PT-BR model writing `"sentiment_score": 4,5` should receive the same reward as one writing `"sentiment_score": 4.5`.
167
+
168
+ ---
169
+
170
+ ## 5. The Ugly
171
+
172
+ ### 5.1 💀 We Initially Misdiagnosed clip_ratio=0 as Failure
173
+
174
+ For three versions (V1-V3) and the early part of V4, we interpreted `clip_ratio=0` as proof that the policy wasn't learning — that entropy collapse was the failure mode. This led to:
175
+
176
+ 1. **Designing V4 with a hard probe gate** (clip_ratio > 0 on 3/10 steps) that would have killed a healthy run
177
+ 2. **Planning algorithmic changes** (DAPO Clip-Higher, entropy bonus, DPO fallback) to "fix" a non-problem
178
+ 3. **Delaying the V4 run** while researching solutions to the wrong diagnosis
179
+ 4. **Attributing V2's +42% improvement to "SFT baseline, not learning"** — when it was in fact GRPO learning through the unclipped branch
180
+
181
+ The probe gate in Cell 12 was designed to fail on clip_ratio=0 and redirect to the fallback plan. Had we enforced it strictly, we would have abandoned a working training run.
182
+
183
+ **Lesson:** Always validate your diagnostic metrics against the ground truth (eval reward on held-out data), not against theoretical expectations derived from different settings (math reasoning on base models at full-parameter scale).
184
+
185
+ ---
186
+
187
+ ## 6. Decisions: Why clip_ratio=0 Is Expected
188
+
189
+ ### 6.1 The Original Interpretation (Wrong)
190
+
191
+ We believed `clip_ratio=0` meant "entropy collapse — the policy never moved from initialization." This was based on:
192
+ - V3's clip_ratio=0 + reward=0.87 (SFT baseline) → model producing good outputs without learning
193
+ - DAPO (2503.14476) §3.1: symmetric clipping restricts exploration tokens
194
+ - Skywork-OR1 (2505.22312): entropy collapse as the GRPO failure mode
195
+
196
+ This interpretation was correct **for V3** (where the Think model genuinely didn't learn — its reward was just the SFT baseline). But it was wrong as a universal diagnostic.
197
+
198
+ ### 6.2 The Revised Understanding (Correct)
199
+
200
+ `clip_ratio=0` means the token probability ratios `π_new(token) / π_old(token)` stay within `[1-ε, 1+ε]` = `[0.8, 1.2]`. The safety clamp never fires. **But gradients still flow through the unclipped branch on every step.** The model is learning — just gradually.
201
+
202
+ Three factors make this expected for our setup:
203
+
204
+ #### Factor 1: LoRA Constrains Policy Shift Magnitude
205
+
206
+ We're training 10M of 490M parameters (~2%). The logit distribution changes slowly per step because:
207
+ - LoRA updates are low-rank perturbations: `W + A×B` where A∈R^{d×16} and B∈R^{16×d}
208
+ - The LoRA output is scaled by α/r = 32/16 = 2.0, but this is still a small perturbation relative to the pretrained weights
209
+ - Token probability ratios therefore stay naturally close to 1.0
210
+
211
+ **Paper evidence:** DCPO (2509.02333) §5.2 demonstrates that once a token's old probability `q(x) ≥ 1/(1+ε) ≈ 0.83`, the new probability will participate in model updates within the interval `[1/(1+ε), 1]` **without being clipped regardless of any clipping method**. For an Instruct model generating predictable tokens (JSON field names, Portuguese articles, common verbs), the vast majority of tokens have `q(x) > 0.9`. They are never candidates for clipping.
212
+
213
+ #### Factor 2: Narrow Domain = Concentrated Token Distributions
214
+
215
+ E-commerce review analysis is a narrow distribution. The model is not being asked to discover new knowledge domains — it's refining:
216
+ - JSON field name selection from a known schema
217
+ - Sentiment vocabulary from a small set (positive, negative, neutral)
218
+ - Portuguese business vocabulary it already knows
219
+ - SQL patterns over a fixed schema
220
+
221
+ For these tasks, moving a token probability from 92% to 94% produces genuine quality improvement but never triggers the `[0.8, 1.2]` clip boundary.
222
+
223
+ **Paper evidence:** Tricks or Traps (2508.08221) §4.2.1 measured clip rates on base models and found them as low as 0.003. For LoRA fine-tuning on an already-aligned Instruct model in a narrow domain, clip_ratio = 0 is the natural regime.
224
+
225
+ #### Factor 3: Instruct + APO Anchor Stabilizes the Policy
226
+
227
+ The Tucano2-0.5B-Instruct model was already post-trained with APO (a DPO variant) for 1,115 steps with `dpo_beta=0.5`. This created a strong policy anchor. GRPO on top of this anchor produces small, precise adjustments — not radical distributional shifts.
228
+
229
+ This is fundamentally different from the math-reasoning GRPO papers where:
230
+ - Training starts from a **base model** (not instruction-tuned)
231
+ - The task requires discovering novel reasoning strategies (not refining known patterns)
232
+ - Full-parameter training (not LoRA) allows larger per-step policy shifts
233
+ - clip_ratio > 0 is expected because the base model must learn entirely new behaviors
234
+
235
+ #### Factor 4: KL=0 Is a Void Metric, Not a Failure Signal
236
+
237
+ With `β=0.0`, TRL skips reference model loading entirely. The KL divergence is not computed — it's reported as 0.0 by default. This is a reporting artifact.
238
+
239
+ DAPO (2503.14476) §2.3 explicitly removes KL penalty for rule-based rewards: "this restriction is not necessary." Ignore this metric completely in any β=0 run.
240
+
241
+ ### 6.3 The Proof: Eval Improvement
242
+
243
+ The definitive evidence that learning occurred despite clip_ratio=0:
244
+
245
+ | Step | Eval Reward | Delta |
246
+ |------|-------------|-------|
247
+ | 10 | 0.357 | baseline |
248
+ | 20 | 0.416 | +16.5% |
249
+ | 130 | 0.476 | +33.3% |
250
+
251
+ Held-out eval reward improved monotonically through step 130. The LoRA adapter absorbed reward signal through the unclipped policy gradient branch. This is consistent with Tricks or Traps (2508.08221) §3.2, which found that aligned models show "roughly 2%" improvement from RL — our +33% exceeds that, likely because we're doing domain specialization (narrow task, high reward signal quality) rather than general reasoning improvement.
252
+
253
+ ### 6.4 When clip_ratio=0 IS a Problem
254
+
255
+ To be clear: clip_ratio=0 **was** a genuine problem in V3. The difference:
256
+
257
+ | | V3 (Problem) | V4 (Not a Problem) |
258
+ |---|---|---|
259
+ | Eval reward trend | Flat at SFT baseline | +33% improvement over 130 steps |
260
+ | Completion length | 2,628 (100% ceiling) | 163 (0% ceiling) |
261
+ | Reward | 0.87 (SFT already does this) | 0.497 (room to improve) |
262
+ | Model | Think (uncontrollable `<think>`) | Instruct (clean output) |
263
+ | Underlying cause | Think model produces SFT-quality output, GRPO can't improve it | LoRA naturally constrains policy shift to within clip bounds |
264
+
265
+ The diagnostic should always be **eval reward trend**, not clip_ratio. If eval improves, the model is learning. Period.
266
+
267
+ ---
268
+
269
+ ## 7. Areas of Improvement at 0.5B Scale
270
+
271
+ Before scaling to 3.7B, we should exhaust improvement opportunities at 0.5B. The model is cheap to iterate on (2.5h/run), and lessons learned here transfer directly to the larger model. Below are concrete improvements ranked by expected impact, with paper evidence for each.
272
+
273
+ ### 7.1 🔴 Fix the JSON Parser (Immediate, Zero-Cost)
274
+
275
+ **What:** Replace the brittle regex-based `_extract_json()` with a robust parser that handles Portuguese localization and LLM output quirks.
276
+
277
+ **Why:** The current parser fails on Portuguese decimal commas (`4,5` instead of `4.5`), trailing commas, single quotes, markdown code fences, and other common LLM formatting. This means the model is penalized for generating correct-but-unparseable JSON. The extraction reward — the weakest task at ~0.17 — is artificially depressed.
278
+
279
+ **Implementation:** Use `json-repair` library with a string-aware Portuguese decimal normalizer:
280
+
281
+ ```python
282
+ import json_repair
283
+
284
+ def _normalize_pt_decimals(s: str) -> str:
285
+ """Convert PT-BR number formats to JSON-compatible, only outside quoted strings."""
286
+ result, in_string, escape_next = [], False, False
287
+ i = 0
288
+ while i < len(s):
289
+ c = s[i]
290
+ if escape_next:
291
+ result.append(c); escape_next = False; i += 1; continue
292
+ if c == '\\' and in_string:
293
+ result.append(c); escape_next = True; i += 1; continue
294
+ if c == '"':
295
+ in_string = not in_string; result.append(c); i += 1; continue
296
+ if not in_string:
297
+ m = re.match(r'(\d+),(\d+)', s[i:])
298
+ if m:
299
+ result.append(m.group(1) + '.' + m.group(2))
300
+ i += len(m.group(0)); continue
301
+ result.append(c); i += 1
302
+ return ''.join(result)
303
+
304
+ def _extract_json(text: str) -> dict | None:
305
+ stripped = text.strip()
306
+ stripped = re.sub(r'^```(?:json)?\s*|\s*```$', '', stripped, flags=re.MULTILINE).strip()
307
+
308
+ for attempt_text in [stripped, _normalize_pt_decimals(stripped)]:
309
+ try:
310
+ return json.loads(attempt_text)
311
+ except (json.JSONDecodeError, TypeError):
312
+ pass
313
+ try:
314
+ result = json_repair.repair_json(attempt_text, return_objects=True)
315
+ if isinstance(result, dict):
316
+ return result
317
+ except Exception:
318
+ pass
319
+ return None
320
+ ```
321
+
322
+ **Expected impact:** Extraction reward 0.17 → 0.40-0.55. The model IS producing valid JSON; the parser just can't read it. This is the highest-ROI change available — zero training compute, immediate reward improvement.
323
+
324
+ **Why it matters for GRPO:** When the parser fails, the model receives near-zero reward for correct output. This is a **reward misspecification** — the reward function punishes desired behavior. Fixing it means GRPO gets accurate gradient signal for extraction, which should improve the task disproportionately on the next run.
325
+
326
+ ### 7.2 🔴 Train for Multiple Epochs (High Impact, Zero-Cost)
327
+
328
+ **What:** Increase `MAX_STEPS` from 200 to 600-1000 to cover 2-3 full epochs, or set `NUM_EPOCHS=3` with no `MAX_STEPS` cap.
329
+
330
+ **Why:** V4 only trained on 13.5% of the available data (`train/epoch = 0.135`). Of ~1,480 training prompts, only ~400 were sampled. Some tasks (insights=114, push=222) may have been barely represented. The model hasn't had the opportunity to:
331
+ - See the full distribution of extraction JSON schemas
332
+ - Reinforce learning on hard examples
333
+ - Benefit from prompt diversity across all task types
334
+
335
+ **Paper evidence:**
336
+ - Prompt Augmentation (2602.03190) §3: demonstrates that training for longer (more data exposure) improves GRPO performance, as long as entropy collapse is managed. Their key finding is that prompt diversity (not just more steps) enables longer training.
337
+ - Extract-0 (2509.22906) §3.3: achieved their best extraction performance at step 190 of 248 — they ran nearly a full dataset pass. V4 stopped at 13% of one pass.
338
+ - "It Takes Two" (2510.00977) §3.1: reframes GRPO as N-vs-M contrastive learning. More data exposure means more contrastive pairs, which directly improves the quality of the gradient estimator.
339
+
340
+ **Risk:** The eval plateau at step 130 might persist through additional epochs. If so, the problem is reward function ceiling or capacity limit, not data exposure. **Mitigation:** Use constant LR (or cosine-with-restarts) instead of cosine decay — see 7.3.
341
+
342
+ **Expected impact:** Eval reward 0.476 → 0.52-0.55. The model sees 3× more data with the parser fix providing cleaner signal.
343
+
344
+ ### 7.3 🔴 Fix Learning Rate Schedule (High Impact, Zero-Cost)
345
+
346
+ **What:** Replace cosine decay with either constant LR or cosine-with-warm-restarts.
347
+
348
+ **Why:** The V4 cosine schedule decayed to `1.52e-10` by step 200 — effectively zero. The model spent the last 70 steps (130→200) unable to learn because the LR was vanishingly small. This almost certainly contributed to the eval plateau.
349
+
350
+ For a 200-step run, cosine decay is too aggressive. The LR reaches half its peak by step 100 and near-zero by step 150. But for multi-epoch training (600+ steps), the schedule needs to sustain gradient magnitude through additional passes.
351
+
352
+ **Options:**
353
+ 1. **Constant LR** with warmup: `warmup_ratio=0.05, lr_scheduler_type="constant_with_warmup"`. Simple, effective. The learning rate stays at peak (2e-6) for the entire run.
354
+ 2. **Cosine with restarts**: `lr_scheduler_type="cosine_with_restarts"`. The LR periodically resets, allowing the model to "re-explore" at each restart. Each restart cycle can refine on new data.
355
+ 3. **Higher peak LR**: Increase from 2e-6 to 5e-6. The grad_norm was 0.065 — the model can tolerate stronger updates. Dr. GRPO (2503.20783) Appendix G uses LR up to 5e-6 for comparable setups.
356
+
357
+ **Paper evidence:**
358
+ - Tricks or Traps (2508.08221) §4.1: found that normalization strategy matters more than LR schedule, but notes that constant LR with warmup is the most common choice in successful GRPO runs.
359
+ - DAPO (2503.14476): uses constant LR with warmup for all their runs.
360
+
361
+ **Expected impact:** Extends the learning window from step 130 to step 400+. Combined with more epochs, this could add +5-10% eval reward.
362
+
363
+ ### 7.4 🟡 GDPO: Per-Reward Normalization (Moderate Impact, Moderate Effort)
364
+
365
+ **What:** Normalize each reward component (extraction, SQL, insights, push) separately before summing, then apply batch-wise normalization to the sum. This is the GDPO method from NVIDIA (2601.05242).
366
+
367
+ **Why:** The current reward function sums all components into a single scalar. GRPO then applies group-wise normalization to this sum. This causes **information loss**: distinct reward combinations map to identical advantages.
368
+
369
+ **Example from our setup:**
370
+ - Completion A: extraction=0.6, language=0.3 → sum=0.9
371
+ - Completion B: extraction=0.3, language=0.6 → sum=0.9
372
+ - After group normalization: A and B get **identical advantages** despite having completely different error profiles.
373
+
374
+ With GDPO, each component is normalized independently, preserving the fine-grained distinction between "good extraction, bad language" and "bad extraction, good language."
375
+
376
+ **Paper evidence:**
377
+ - GDPO (2601.05242) §3.1: demonstrates that decoupled normalization produces significantly more distinct advantage groups. With 4 reward components and G=16 rollouts, GDPO preserves ~4× more gradient information than standard GRPO.
378
+ - MO-GRPO (2509.22047) Theorem 1: proves that GRPO's advantage function correlates more strongly with higher-variance reward components. If `reward_insights` has higher variance than `reward_extraction`, GRPO preferentially optimizes insights at the expense of extraction — regardless of which task needs more improvement.
379
+ - GDPO (2601.05242) §3.2, Eq. 8: for tasks with asymmetric difficulty (extraction is hard, push is easy), **conditioning easy rewards on hard ones** prevents the model from gaming easy tasks. E.g., the push reward only fires if extraction reward > threshold.
380
+
381
+ **Implementation complexity:** Moderate. Requires decomposing `commerce_reward_fn` to return per-component rewards (not just a sum), then modifying the advantage computation. Since we're using TRL 0.24.0, this likely requires a custom trainer subclass or a reward function that encodes components in the scalar (e.g., returning separate reward signals).
382
+
383
+ **Expected impact:** Extraction reward improvement (+10-20%), more balanced cross-task learning. The effect is strongest when task difficulties are asymmetric — which is exactly our case (extraction=0.17, insights=0.70).
384
+
385
+ ### 7.5 🟡 Increase LoRA Rank (Moderate Impact, Low Effort)
386
+
387
+ **What:** Increase LoRA rank from r=16 to r=64 or r=128, with corresponding α scaling.
388
+
389
+ **Why:** At r=16, the LoRA adapter has ~10M trainable parameters out of 490M (~2%). This constrains the model's capacity to learn new behaviors. For domain specialization across 4 distinct task types, the adapter may need more expressivity.
390
+
391
+ **Paper evidence:**
392
+ - Gazal-R1 (2506.21594) §3.2: used LoRA rank **256** with rsLoRA scaling (`α/√r` instead of `α/r`) for their SFT+GRPO pipeline on medical reasoning. They specifically cite rank-stabilized LoRA as enabling "higher learning capacity" at high ranks without gradient collapse.
393
+ - Extract-0 (2509.22906): used standard LoRA for their GRPO extraction pipeline, but on a 7B model where the base capacity is larger. At 0.5B, the LoRA rank may need to be proportionally higher to achieve comparable expressivity.
394
+
395
+ **Scaling consideration:** With rsLoRA, the scaling factor is `α/√r` instead of `α/r`. At r=64, α=128: standard LoRA gives scaling=2.0, rsLoRA gives scaling=16.0. This prevents the vanishing update problem at high ranks.
396
+
397
+ **VRAM impact:** Minimal. Going from r=16 to r=64 adds ~30M parameters (10M → 40M), which is ~0.15GB at bf16. Well within L4's headroom.
398
+
399
+ **Expected impact:** More expressive adapter → better multi-task learning → +5-10% eval reward. The effect should be most visible on tasks where the SFT model is weakest (extraction, SQL), because those require the most behavioral change.
400
+
401
+ ### 7.6 🟡 Prompt Augmentation (Moderate Impact, Moderate Effort)
402
+
403
+ **What:** Apply template-based prompt augmentation — present the same underlying task through multiple prompt formats (formal JSON request, conversational question, bullet-point instruction, etc.).
404
+
405
+ **Why:** With ~1,480 prompts, the model may memorize prompt patterns rather than learning the underlying extraction/analysis skill. Augmenting prompts forces the model to generalize.
406
+
407
+ **Paper evidence:**
408
+ - Prompt Augmentation (2602.03190) §3.2-3.3: demonstrates that augmenting prompts with diverse templates enables GRPO to train for longer without entropy collapse. They use 13 templates across 4 categories (DeepSeek-style, free-form, reflection-based, chain-of-thought). Their key result: prompt augmentation allows training to continue past the point where vanilla GRPO collapses.
409
+ - Cocktail Effect (2410.01109) §4: mixing 30% general reasoning data with domain-specific data improves domain performance by 2-15%. The regularization effect prevents overfitting to domain-specific prompt patterns.
410
+
411
+ **Implementation for our case:**
412
+ 1. **Extraction templates:** Vary between "Retorne um JSON com os campos...", "Analise e extraia os seguintes dados em formato JSON...", "Faça a extração estruturada da seguinte avaliação..."
413
+ 2. **SQL templates:** Vary between direct question, business scenario framing, analytical report request
414
+ 3. **Mix in 30% general Portuguese instruction data** (from the Tucano2 training corpus or Portuguese Alpaca)
415
+
416
+ **Expected impact:** +5-10% eval reward + ability to train for more epochs without overfitting.
417
+
418
+ ### 7.7 🟡 Dynamic Task Weighting (Moderate Impact, Moderate Effort)
419
+
420
+ **What:** Track per-task reward improvement rates and upweight underperforming tasks (extraction) during training.
421
+
422
+ **Why:** The current fixed sampling (40% extraction, 40% SQL, 10% insights, 10% push) doesn't adapt to learning dynamics. If extraction plateaus while insights keeps improving, the model wastes gradient budget on a task that's already converging.
423
+
424
+ **Paper evidence:**
425
+ - Multi-Task GRPO (2602.05547) §3: proposes Improvement-aware Weight Update (IWU) that monitors per-task reward trends and dynamically adjusts task sampling probabilities. Their key finding: dynamic weighting prevents the "collapse to easy task" failure mode.
426
+ - MO-GRPO (2509.22047): demonstrates that equal weighting causes GRPO to preferentially optimize higher-variance reward components, regardless of task importance.
427
+
428
+ **Implementation:** A callback that tracks per-task reward moving averages and adjusts task sampling weights every N steps. Tasks with stagnating rewards get upweighted; tasks approaching their ceiling get downweighted.
429
+
430
+ **Expected impact:** More balanced cross-task performance. Extraction (currently weakest) gets more gradient attention.
431
+
432
+ ### 7.8 🟢 G=32 Exploration (Low Effort, Uncertain Impact)
433
+
434
+ **What:** Increase group size from G=16 to G=32.
435
+
436
+ **Why:** More rollouts per prompt means more reward variance per group, which means higher-quality advantage estimates. At 0.5B with 512 max completion, the VRAM should support G=32.
437
+
438
+ **Paper evidence:**
439
+ - "It Takes Two" (2510.00977) §3.1: reframes GRPO as N-vs-M contrastive learning. Larger G = more contrastive pairs = better gradient quality. However, they also show that G=2 can work surprisingly well, suggesting diminishing returns.
440
+ - MC-GRPO (2601.22582): shows that smaller G can work with median baseline correction. The question is whether G=16→32 provides meaningful improvement or just costs 2× compute.
441
+
442
+ **Trade-off:** G=32 doubles generation time (~88s/step instead of 44s). A 600-step run would take ~15h instead of ~7h.
443
+
444
+ **Expected impact:** Uncertain. G=16 already provides good reward variance (std=0.116). G=32 might add marginal improvement. Try this **after** the higher-ROI changes (parser fix, multi-epoch, LR schedule).
445
+
446
+ ### 7.9 🟢 Reward Function Refinement (Low Effort, Incremental Impact)
447
+
448
+ **What:** Tighten the reward function to discriminate more finely at higher performance levels.
449
+
450
+ **Why:** The current reward functions use coarse heuristics (keyword matching, length ranges, structure markers). At eval_reward=0.476, the model may be near the ceiling of what these heuristics can distinguish. Two completions that are qualitatively different might score identically.
451
+
452
+ **Specific refinements:**
453
+ 1. **Extraction:** After parser fix, add semantic similarity scoring for `main_complaint` field (using sentence-transformers). Add bonus for correct field VALUE, not just field PRESENCE.
454
+ 2. **SQL Q&A:** Add SQL syntax validation (even without execution). Reward actual SQL keywords (`SELECT`, `WHERE`, `JOIN`, `GROUP BY`).
455
+ 3. **Insights:** Add Portuguese fluency scoring via perplexity check against the base model. Reward domain-specific terminology density.
456
+ 4. **Push:** Add emoji/urgency detection. Reward call-to-action phrases.
457
+
458
+ **Expected impact:** Incremental (+2-5% per refinement), but compounds across tasks. The reward function is the product specification — making it sharper makes the product better.
459
+
460
+ ---
461
+
462
+ ## 8. Summary & Next Steps
463
+
464
+ ### 8.1 What V4 Proved
465
+
466
+ 1. **The Instruct model is the right base.** No `<think>` overhead, clean output, healthy training dynamics.
467
+ 2. **GRPO works for domain specialization.** +33% eval improvement on a 0.5B model in 2.5h.
468
+ 3. **clip_ratio=0 is normal for LoRA + narrow domain.** Not a failure signal. Eval improvement is the ground truth.
469
+ 4. **The recipe is validated.** Task-aware prompts, continuous rewards, G=16, temp=1.0, β=0.0, scale_rewards=False.
470
+
471
+ ### 8.2 Recommended V4.1 Run (Squeeze 0.5B First)
472
+
473
+ Before scaling to 3.7B, run V4.1 with these changes:
474
+
475
+ | Change | From (V4) | To (V4.1) | Section |
476
+ |--------|-----------|-----------|---------|
477
+ | JSON parser | Brittle regex | json-repair + PT decimal normalizer | §7.1 |
478
+ | Max steps | 200 | 600 (3 epochs) | §7.2 |
479
+ | LR schedule | Cosine (decays to ~0) | Constant with warmup | §7.3 |
480
+ | Peak LR | 2e-6 | 5e-6 | §7.3 |
481
+ | LoRA rank | r=16, α=32 | r=64, α=128, rsLoRA | §7.5 |
482
+
483
+ **Expected outcome:** Eval reward 0.476 → 0.55-0.60. If achieved, this validates that the 0.5B model has more to give. If still plateaus at ~0.48, the capacity limit is real and we scale to 3.7B.
484
+
485
+ ### 8.3 V4.2 Run (If V4.1 Shows Improvement)
486
+
487
+ | Change | From (V4.1) | To (V4.2) | Section |
488
+ |--------|-------------|-----------|---------|
489
+ | Reward normalization | Sum then normalize | GDPO per-component normalize | §7.4 |
490
+ | Prompt augmentation | Fixed prompts | 3-5 templates per task | §7.6 |
491
+ | Task weighting | Fixed 40/40/10/10 | Dynamic IWU | §7.7 |
492
+ | Reward refinement | Coarse heuristics | Semantic similarity + SQL validation | §7.9 |
493
+
494
+ ### 8.4 Scale to 3.7B (After 0.5B Is Maximized)
495
+
496
+ The validated recipe from V4.1/V4.2 transfers to `Polygl0t/Tucano2-qwen-3.7B-Instruct` (or Base → SFT → GRPO if Instruct variant doesn't exist). Adjustments needed:
497
+ - G: 16 → 4-8 (VRAM constraint)
498
+ - LR: 5e-6 → 1e-6 to 2e-6 (larger model = smaller LR)
499
+ - LoRA rank: may keep r=64 or increase to r=128
500
+ - Completion length: may increase to 1024 for richer insights
501
+
502
+ ---
503
+
504
+ ## 9. Paper References
505
+
506
+ | Paper | ArXiv ID | Key Finding Used | Where Applied |
507
+ |-------|----------|------------------|---------------|
508
+ | **DCPO** | 2509.02333 | Once q(x) ≥ 0.83, tokens update without clipping regardless of method | §6.2: explains why clip_ratio=0 is expected |
509
+ | **Tricks or Traps** | 2508.08221 | Base models have clip rate ≈ 0.003; local mean + global std is robust normalization | §6.2: clip_ratio=0 is normal; §7.3: LR schedule |
510
+ | **DAPO** | 2503.14476 | KL penalty unnecessary for rule-based rewards; uses constant LR | §6.4: KL=0 is void; §7.3: LR schedule |
511
+ | **RL-ZVP** | 2509.21880 | Zero-variance prompts provide meaningful feedback via entropy-guided shaping | §3.2: our 0% zero-variance is ideal |
512
+ | **GDPO** | 2601.05242 | Decoupled per-reward normalization preserves fine-grained advantage distinctions | §7.4: multi-task reward improvement |
513
+ | **MO-GRPO** | 2509.22047 | GRPO advantage correlates with higher-variance reward components | §7.4, §7.7: task imbalance diagnosis |
514
+ | **Prompt Augmentation** | 2602.03190 | Diverse prompt templates enable longer GRPO training without entropy collapse | §7.6: data augmentation strategy |
515
+ | **Cocktail Effect** | 2410.01109 | 30% general data mixing improves domain performance 2-15% | §7.6: regularization via general data |
516
+ | **Extract-0** | 2509.22906 | 7B + SFT + GRPO → +147% on extraction; GRPO added +13% over SFT alone | §3.3: comparable extraction pipeline |
517
+ | **Gazal-R1** | 2506.21594 | LoRA rank 256 + rsLoRA for SFT+GRPO; DoRA for medical reasoning | §7.5: high LoRA rank for GRPO |
518
+ | **"It Takes Two"** | 2510.00977 | GRPO is secretly DPO as N-vs-M contrastive learning | §7.2, §7.8: more data + larger G = better |
519
+ | **MT-GRPO** | 2602.05547 | Dynamic task weighting prevents collapse to easy tasks | §7.7: task rebalancing |
520
+ | **Dr. GRPO** | 2503.20783 | Remove std normalization, β=0, higher LR | V4 config baseline |
521
+ | **Skywork-OR1** | 2505.22312 | τ=1.0 for exploration, entropy monitoring | V4 config baseline |
522
+ | **MC-GRPO** | 2601.22582 | Median baseline for small rollout budgets | §7.8: G optimization |
523
+
524
+ ---
525
+
526
+ *Assessment generated 2026-04-27. W&B run: `tferrazrafael-self/tucano2-commerce`, run name `grpo-v4-instruct-0.5B`.*