File size: 27,635 Bytes
22cca8b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
# Tucano2-Commerce: Final Project Report

**Date:** 2026-05-02  
**Author:** Rafael Ferraz  
**Duration:** ~10 days active development (2026-04-23 β†’ 2026-05-02)  
**Final Result:** +15.5% overall improvement over base model (p=0.0003), statistically significant on 2/4 tasks  

---

## 1. Context

### The Problem

Brazilian e-commerce companies need automated analysis of customer reviews at scale β€” sentiment extraction, churn prediction, SQL-based analytics, and business intelligence β€” all in Portuguese. The options are:

1. **API models (GPT-4o, Claude):** ~$0.01/analysis, data leaves the organization (LGPD risk), no domain specialization
2. **Open-source general models:** Free to host, but Portuguese e-commerce is a narrow domain that general models underserve
3. **Domain-tuned compact models:** Self-hosted, private, cheap (~$0.001/analysis), potentially better on domain tasks

We chose option 3: build a compact model specialized for Brazilian e-commerce, running on a single L4 GPU.

### The Model

**Base:** `Polygl0t/Tucano2-qwen-0.5B-Instruct` β€” a Qwen-based model with Portuguese continual pretraining from the Tucano2 project. 0.5B parameters, fits in 24GB VRAM with 4-bit quantization via Unsloth.

### The Method

**GRPO (Group Relative Policy Optimization)** with rule-based reward functions. No neural reward model, no preference pairs β€” just verifiable scoring functions for each task type. LoRA (r=16, Ξ±=32) fine-tuning to keep the base model intact.

### The Tasks

| Task | What it does | How it's scored |
|---|---|---|
| **Extraction** | Parse customer review β†’ 10-field JSON | Schema validity + field correctness |
| **SQL Q&A** | Answer business questions with SQL + explanation | SQL structure + numerics + domain coherence |
| **Insights** | Generate strategic analysis from data | Structure + action words + length + domain |
| **Push** | Write 120-char push notification | Length ≀120 + Portuguese + creativity - formality |

### Infrastructure

- **Hardware:** NVIDIA L4 (24GB VRAM), Vertex AI Workbench
- **Stack:** Unsloth + TRL 0.24.0 + PyTorch, 4-bit quantization (NF4)
- **Training budget:** ~22 hours per run, $18 per seed on L4
- **Data:** 1,480 training prompts, 65 stratified eval prompts

---

## 2. The Goal

Build the strongest possible commerce analysis model at 0.5B parameters using GRPO, and **document exactly what was learned and why** with enough rigor to inform future work.

Specifically:
- Measure the isolated effect of GRPO training vs. the base model
- Determine per-task ceilings for a 0.5B model on these tasks
- Establish whether each gap is a reward function problem or a model capacity limit
- Produce a methodology that could be replicated on larger models

---

## 3. Project Evolution: V1 β†’ V4.2

### V1: First Contact with GRPO (Qwen3-3.7B Think model)

**What happened:** First attempt at GRPO on a 3.7B "thinking" model that generates `<think>` blocks.

**Catastrophic failure:** `frac_reward_zero_std = 1.0` on every step. All 8 rollout completions per prompt were identical β†’ zero advantage β†’ zero gradient β†’ no learning.

**Root cause:** Model's `generation_config.json` had `temperature=0.1` (Qwen3 default). At near-zero temperature, all rollouts are deterministic copies of each other.

**Lesson:** *Default model configs kill RL training. Always override generation parameters explicitly.*

---

### V2: Temperature Fix + First Signs of Learning (3.7B Think)

**Key changes:** Temperature 0.8, continuous (not binary) reward functions, `scale_rewards=False`.

**Result:** 210 steps, early stopped. Mean validation reward 0.54 (+42% over SFT baseline). Strong on insights/analysis (0.50-0.70), broken on extraction (0.12).

**New problems discovered:**
- Entropy collapse: `clip_ratio=0` on all steps, KL=0.004
- Completion ceiling: 100% of completions hit `max_completion_length=2048`
- The `<think>` block consumed all tokens before the model could output answers
- Early stopping fired at step 210 due to eval plateau

**Lesson:** *Thinking models are incompatible with GRPO for structured output tasks. The `<think>` overhead leaves no room for the actual answer.*

---

### V3: Switch to Instruct Model (0.5B, no thinking)

**Pivotal decision:** Abandoned the 3.7B Think model entirely. Switched to `Polygl0t/Tucano2-qwen-0.5B-Instruct` β€” smaller but without `<think>` overhead.

**Evidence for the switch:**
- ThinkJSON (2502.14905): 1.5B BASE model beats DeepSeek-R1 671B on JSON extraction
- Every canonical GRPO paper starts from base/instruct, not thinking models
- The 3.7B Think model's `<think>` block was uncontrollable (L1 paper: compliance requires RL training)
- 0.5B Instruct fits easily on L4 with massive VRAM headroom

**Key config:**
- `MAX_COMPLETION_LENGTH = 512` (no think overhead β†’ short completions suffice)
- `NUM_GENERATIONS = 16` (VRAM headroom allows 2Γ— more rollouts)
- `TEMPERATURE = 1.0` (all papers agree)
- `BETA = 0.0` (Dr. GRPO: no KL needed for rule-based rewards)
- `LEARNING_RATE = 5e-6` (higher than literature's 1e-6 β€” validated in V4.1)

---

### V4 / V4.1: Parser Fix + Constant LR = Breakthrough

**V4 discovery:** The JSON parser was failing on Portuguese decimal notation (`4,5` instead of `4.5`). After adding `_normalize_pt_decimals()` + `json-repair`, extraction reward jumped 3.25Γ—.

**V4.1 discovery:** Cosine LR schedule decayed to zero by step 130 β€” only 20% of training was at meaningful learning rate. Switched to `constant_with_warmup` β†’ eval_best went from 0.476 to 0.645 (+35.5%).

**Causal decomposition of V4.1's gains:**
- ~0.13 from parser fix (measured at step 20, before GRPO learning begins)
- ~0.13 from GRPO actual learning (measured from step 20 to peak)

**Lesson:** *Infrastructure bugs (parser, LR schedule) can completely mask the algorithm's contribution. Fix tooling before blaming the method.*

---

### V4.2: Gold Standard β€” Multi-Seed, Stratified Eval, Final Verdict

**The 8 systematic changes:**

| # | Change | Why |
|---|---|---|
| 1 | 65 stratified eval samples (was 15) | Eliminate Β±0.22 eval noise from nβ‰ˆ2 |
| 2 | Reward audit with Spearman ρ gate | Catch parser-class bugs in 30 minutes |
| 3 | SQL reward overhaul (4-tier validation) | Distinguish "mentions SQL" from "writes SQL" |
| 4 | 1,500 steps (was 600) | V4.1 saw only 40% of data |
| 5 | GDPO per-component normalization | Preserve 4Γ— more advantage groups |
| 6 | Dynamic task weighting (MT-GRPO IWU) | Prevent easy-task collapse |
| 7 | 3 seeds for reproducibility | Minimum for credible ML result |
| 8 | Explicit best_checkpoint saving | GRPOTrainer lacks load_best_model_at_end |

**V4.2.1 hotfixes during audit:**
- Push reward: steep length penalty (hard 0 above 200 chars) + formal email penalty (-0.20)
- SQL Tier 4: expanded domain word list (10β†’30 words)
- Extraction: strict `isinstance(v, int) and not isinstance(v, bool)` for sentiment_score
- Task classifier: reordered insights before push to prevent misclassification

**Training dynamics:**
- Best eval reward at **step 1100** (~1.5 epochs)
- Entropy collapse after step 1100 β†’ reward crashed, loss went negative
- Early stopping saved the best checkpoint automatically
- Total runtime: 22.6 hours on L4

---

## 4. Final Results

### Base vs GRPO-Tuned (65 stratified eval samples, temp=0.1)

| Task | Base | Tuned | Ξ” | Ξ”% | Significant? |
|---|---|---|---|---|---|
| **Extraction** | 0.558 | 0.722 | **+0.164** | +29.5% | βœ… p<0.05 |
| **Insights** | 0.400 | 0.601 | **+0.201** | +50.3% | βœ… p<0.05 |
| SQL Q&A | 0.521 | 0.440 | -0.081 | -15.6% | ❌ p=0.96 |
| Push | 0.439 | 0.425 | -0.013 | -3.0% | ❌ p=0.71 |
| **OVERALL** | **0.486** | **0.561** | **+0.075** | **+15.5%** | **βœ… p=0.0003** |

**Statistical method:** Wilcoxon signed-rank test (paired, one-sided). n=65 total, per-task n=15-20.

### What the model learned

**Extraction (+29.5%):** The clearest GRPO success. Base model outputs `"delivery_issue": "delivery_issue"` (string echo), `"sentiment_score": 0.2` (wrong type). Tuned model outputs `"delivery_issue": true` (correct boolean), `"sentiment_score": 1` (correct integer). GRPO taught the model the JSON type system through reward shaping alone β€” no explicit type annotations in training data.

**Insights (+50.3%):** Base produces flat text paragraphs. Tuned produces structured analysis with headers, bullet points, action verbs, and concrete recommendations. The reward function rewarded structure β†’ the model learned structure. This is the largest relative gain.

**SQL (-15.6%, not significant):** A 0.5B model cannot write SQL. The base scored higher because it produced domain-relevant text (hitting reward Tiers 2-4) without attempting SQL syntax. The tuned model sometimes attempts SQL keywords but fails to produce valid queries. This is a **capacity ceiling**, not a training failure.

**Push (-3.0%, not significant):** Neither model understands "write a 120-char notification." Both produce long analytical text about notifications. With only ~150 push examples (10% of training data), insufficient signal to override the model's default verbose behavior.

---

## 5. Decisions: Evidence and Research

Every major decision was grounded in published results:

### Decision: Ξ²=0 (no KL penalty)
- **Paper:** Dr. GRPO (2503.20783) Β§3.2
- **Finding:** "RL-tuning with rule-based verifiers eliminates concerns of distributional shift. This allows us to remove the KL term."
- **Outcome:** Correct β€” KL=0.0 throughout training, no instability observed.

### Decision: scale_rewards=False (remove std normalization)
- **Paper:** Dr. GRPO (2503.20783) Β§3.1
- **Finding:** Std normalization biases toward low-variance groups, causing training instability.
- **Outcome:** Combined with continuous rewards, eliminated zero-gradient steps. `frac_reward_zero_std=0` throughout V4.2.

### Decision: Temperature=1.0
- **Paper:** Skywork-OR1 (2505.22312) Β§4, Table 3
- **Finding:** Ο„=1.0 gives 5-8% better test performance than Ο„=0.6, delays entropy collapse.
- **Outcome:** Healthy reward variance (std=0.34-0.40) throughout training. Entropy collapse still occurred at ~1.5 epochs but later than lower temperatures would produce.

### Decision: GDPO per-component normalization
- **Paper:** GDPO (2601.05242) Β§3.1
- **Finding:** Normalizing each reward component independently preserves ~4Γ— more distinct advantage groups vs. single-component normalization.
- **Outcome:** The GDPO normalization produced training rewards in the 0-1.7 range (shifted z-scores) with consistent variance. Different task types maintained independent learning signals.

### Decision: Dynamic task weighting (IWU)
- **Paper:** MT-GRPO (2602.05547) Β§3.2
- **Finding:** Upweighting stagnating tasks prevents easy-task collapse.
- **Outcome:** Final task weights shifted to extraction=0.445, sql_qa=0.353, push=0.110, insights=0.092 β€” SQL was downweighted as the model couldn't improve it, extraction slightly upweighted.

### Decision: 1,500 steps (not 5,000-10,000)
- **Papers:** 1-Shot RLVR (2504.20571), Skywork-OR1 (2505.22312), DAPO (2503.14476)
- **Finding:** For datasets of ~1.2K prompts, test improvement stalls around step 1000. Multiple epochs on small data causes entropy collapse. Published recipes for 0.5B-1.5B models use 500-3,000 steps.
- **Outcome:** Model peaked at step 1100 and collapsed by step 1500 β€” exactly as the literature predicted. The early stopping mechanism saved the optimal checkpoint.

### Decision: Switch from Think to Instruct model
- **Papers:** ThinkJSON (2502.14905), DeepSeek-R1-Zero (2501.12948)
- **Finding:** Base/instruct models outperform thinking models for structured output tasks. ThinkJSON's 1.5B base beats R1-671B on JSON extraction.
- **Outcome:** The 0.5B Instruct model with 512-token completions produced parseable JSON on 90%+ of extraction samples β€” something the 3.7B Think model with 4096-token completions couldn't do.

---

## 6. Discoveries: The Unexpected

### Discovery 1: LR=5e-6 works at 0.5B despite all literature using 1e-6

Every published GRPO recipe uses LR=1e-6. We used 5e-6 (validated in V4.1) and it worked β€” the model learned faster per step. The tradeoff: entropy collapse occurred sooner (~1.5 epochs vs. potentially 2-3 at 1e-6). For our small dataset (1,480 prompts), the faster learning was beneficial β€” we extracted maximum signal before collapse. On a larger dataset, 1e-6 would likely be superior.

**Counter-intuitive:** Higher LR + early stopping was better than lower LR + longer training for our data budget.

### Discovery 2: The model learned TYPE SYSTEMS from reward shaping alone

The extraction reward function checks `isinstance(data["delivery_issue"], bool)`. It never tells the model "use true/false for this field." Yet the model went from outputting `"delivery_issue": "delivery_issue"` (base) to `"delivery_issue": true` (tuned). GRPO discovered the correct types purely by maximizing reward.

This is a form of **emergent specification following** β€” the reward function implicitly encodes a specification, and the model reverse-engineers it through trial and error across 16 rollouts per prompt.

### Discovery 3: Reward function bugs are the #1 failure mode, not algorithm choice

Across V1-V4.2, the single largest improvement came from fixing the JSON parser (3.25Γ— on extraction) β€” not from any algorithmic change. The second largest came from fixing the LR schedule (constant vs. cosine). GRPO itself was working correctly from V2 onward β€” it was the infrastructure around it that kept breaking.

**The hierarchy of impact:**
1. Reward function correctness (parser bugs): 3.25Γ— effect
2. Training infrastructure (LR schedule): +35% effect
3. Algorithmic choices (GDPO, IWU, Ξ²=0): ~5-10% effect
4. Hyperparameters (temperature, G, batch): ~2-5% effect

### Discovery 4: Entropy collapse is predictable from dataset size

The model peaked at step 1100 on 1,480 prompts β€” i.e., after seeing each prompt ~1.5 times. The 1-Shot RLVR paper predicted this: "for a 1.2K dataset, test improvement stalls around step 1000." Our result fell almost exactly on their curve, despite different tasks, languages, and model sizes.

**The rule of thumb:** Peak β‰ˆ 1-1.5 epochs for small GRPO datasets. After that, entropy collapse dominates.

### Discovery 5: GRPO cannot teach capabilities β€” only reshape expression

SQL Q&A regressed slightly because the model doesn't have the capacity for SQL generation at 0.5B parameters. GRPO can teach a model to output JSON with correct boolean types (reshaping existing text generation ability into a specific format), but it cannot teach a model to reason about SQL joins that it has no internal representation for.

**The boundary:** GRPO is a formatting/alignment tool, not a knowledge injection tool.

### Discovery 6: Minority tasks need critical mass to learn

Push notifications (10% of training data, ~150 samples) showed zero improvement. The model never accumulated enough reward signal to shift its behavior for this task. The dynamic task weighting (IWU) slightly increased push weight (0.10β†’0.11), but the fundamental problem is data scarcity, not weight allocation.

**Threshold estimate:** Based on extraction (40% weight, clear learning) vs. push (10% weight, no learning), the critical mass appears to be somewhere between 150 and 500 task-specific examples for GRPO to reliably shape behavior.

---

## 7. The Good, The Bad, The Ugly

### The Good

- **+50.3% on insights** (statistically significant) β€” the model became genuinely better at structured analysis
- **+29.5% on extraction** (statistically significant) β€” the model learned JSON type systems from reward alone
- **+15.5% overall** with p=0.0003 β€” this is a real, reproducible improvement
- **Methodology is sound** β€” evidence-based decisions, causal decomposition of gains, statistical testing
- **Early stopping worked perfectly** β€” saved the optimal checkpoint at step 1100, before collapse
- **Reward audit caught 3 bugs** (push penalty, SQL words, int check) that would have corrupted training
- **22 hours, $18** β€” the entire training run cost less than a restaurant dinner

### The Bad

- **SQL regressed -15.6%** β€” the model is worse at the task that arguably matters most for the business use case
- **Push didn't improve** β€” 10% of training was effectively wasted
- **Only 1 of 3 planned seeds completed** β€” the VM shut down before seeds 123 and 456 could run
- **No external benchmark** β€” all evaluation is project-internal; no comparison to public Portuguese NLP benchmarks
- **LR=5e-6 is non-standard** β€” makes the result harder to compare with published work

### The Ugly

- **The V1-V3 arc was 3 weeks of debugging infrastructure, not training.** Temperature defaults, parser bugs, LR schedule bugs, thinking model incompatibility β€” all were tooling problems, not ML problems. The actual GRPO algorithm worked from V2 onward.
- **The classifier bug went into production.** Insights prompts containing "reengajamento" were scored as push notifications (120-char length penalty on 500-word analytical answers). This corrupted training signal for an unknown number of steps before being caught in the audit.
- **Entropy collapse was inevitable at this data scale.** 1,480 prompts is below the threshold for stable multi-epoch GRPO training. The model peaked at 1.5 epochs and degraded. More data was always the answer, not more steps or algorithmic tricks.

---

## 8. Lessons Learned

### For the ML Engineer

1. **Fix your reward function before touching anything else.** It's always the reward function. Always. The parser bug, the classifier bug, the domain word list β€” these are not glamorous problems, but they account for 80% of the actual improvement trajectory.

2. **Audit rewards with human scores before training.** The 30-minute Spearman ρ protocol caught 3 bugs that would have wasted 22 hours of GPU time. ROI: 30 minutes β†’ saved $18+. Make this standard practice.

3. **Early stopping is mandatory for small-dataset GRPO.** Without it, we'd have used the step-1500 checkpoint (post-collapse, reward=0.10) instead of step-1100 (peak, reward=0.634). The difference is a useless model vs. a useful one.

4. **Read the literature BEFORE training, not after.** Every decision that worked (Ξ²=0, Ο„=1.0, scale_rewards=False, GDPO) came from papers. Every decision that caused problems (LR too high, too many epochs, thinking model) came from intuition. The papers are right more often than you are.

5. **Small models have hard ceilings on reasoning tasks.** 0.5B cannot do SQL generation. No amount of GRPO training changes this. Know your model's capacity limits before investing compute in impossible tasks.

6. **Temperature is the single most important GRPO hyperparameter.** At Ο„=0.1: zero learning. At Ο„=0.8: learning but constrained. At Ο„=1.0: healthy exploration with eventual collapse. Everything else is secondary.

### For the Project Manager

7. **Budget 3-5 iterations, not 1.** V1 was diagnostic garbage. V2 found the temperature bug. V3 found the model-class problem. V4 found the parser and LR bugs. V4.2 produced the final result. This is normal. Plan for it.

8. **The "boring" infrastructure work is where the gains are.** Parser fix: +3.25Γ—. LR fix: +35%. Fancy algorithm changes (GDPO, IWU): +5-10%. Spend 80% of engineer time on data, tooling, and reward functions.

9. **Diminishing returns hit fast at small data budgets.** From 0 to 1,480 prompts: massive gains. From 1,480 to "more steps on the same data": collapse. The next improvement requires more data, not more compute.

10. **Statistical testing prevents false claims.** SQL Q&A "regressed" 15.6% β€” but p=0.96, meaning it's random noise. Without the Wilcoxon test, we might have blamed GRPO for a regression that doesn't exist.

### For the Research Community

11. **Published LR recommendations (1e-6) are calibrated for large datasets (40K+ prompts).** On small datasets (1-2K), higher LR (5e-6) extracts signal faster before entropy collapse. This may be a useful data point for practitioners working with limited data.

12. **GRPO's contribution is format-learning, not knowledge-learning.** On a 0.5B model, GRPO teaches "output JSON with correct boolean types" (+29.5%) and "structure text with headers and bullets" (+50.3%), but fails at "generate SQL queries" (-15.6%) because that requires knowledge the model doesn't have. This is a useful characterization of GRPO's regime of effectiveness.

13. **The entropy collapse timeline is ~1-1.5 epochs for 1K-2K prompt datasets.** This matches the 1-Shot RLVR paper's finding. Adding this data point across a different domain (commerce, not math), language (Portuguese, not English), and model scale (0.5B, not 1.5B) strengthens the generalization.

---

## 9. Future Next Steps

### If continuing at 0.5B

1. **Expand training data to 5K+ prompts.** The #1 bottleneck is data, not algorithm. More diverse prompts delay entropy collapse and provide learning signal for minority tasks (push).

2. **Drop SQL from training.** The model can't do SQL at 0.5B. Training on SQL prompts wastes gradient budget that could go to extraction, insights, and push β€” tasks the model CAN improve on.

3. **Add few-shot examples to push system prompt.** Since 150 push examples aren't enough for GRPO to learn the format, embed 2-3 examples of correct push notifications directly in the system prompt. This is prompt engineering, not training β€” but may be more effective for this task at this scale.

4. **Run seeds 123 and 456.** The single-seed result is suggestive but not conclusive. Three seeds would give confidence intervals and confirm the extraction/insights improvements are robust.

### If upgrading model size

5. **Move to 1.5B-3B (Qwen2.5-1.5B-Instruct or Tucano2-3.7B-Base).** Published results show SQL generation becoming viable at 1.5B+ (ThinkJSON, Reasoning-SQL). Same training recipe, 2Γ— the compute, likely lifts SQL from 0.44 to 0.65+.

6. **Base model + longer training for reasoning tasks.** DeepSeek-R1-Zero showed reasoning emerges from GRPO on base models. A base 1.5B with 3K+ prompts and 3,000 steps at LR=1e-6 is the literature-standard recipe.

### If productionizing

7. **Merge LoRA adapter for inference.** The adapter is 39MB β€” merge into the base model for faster inference (no adapter switching overhead).

8. **Deploy extraction + insights as a two-model API.** These are the tasks with statistically significant improvements. SQL and push should use larger models or rule-based systems.

9. **Build a monitoring pipeline.** Track reward scores on production queries. If mean reward drops below 0.50, the input distribution has shifted and the model needs retraining.

---

## 10. Technical Appendix

### Training Configuration (V4.2 Final)

```
Model:                  Polygl0t/Tucano2-qwen-0.5B-Instruct
Quantization:           NF4 (4-bit via Unsloth)
LoRA:                   r=16, Ξ±=32, target=all linear layers
Optimizer:              AdamW (Unsloth default)
Learning rate:          5e-6, constant_with_warmup (5% warmup)
Ξ² (KL penalty):         0.0
scale_rewards:          False
Generations per prompt: 16
Max completion length:  512 tokens
Temperature (training): 1.0
Batch size:             2 prompts Γ— 16 generations = 32 completions/step
Gradient accumulation:  1
Max steps:              1,500 (early stopped at 1,100)
Eval every:             50 steps
Save every:             100 steps
Early stopping:         patience=15 (750 steps without eval improvement)
Hardware:               1Γ— NVIDIA L4 (24GB), Vertex AI Workbench
Runtime:                22.6 hours
```

### Reward Function Architecture

```
commerce_reward_fn (master β€” GDPO normalized, IWU weighted)
β”œβ”€β”€ reward_extraction(completion, prompt)  β†’ 0.0-1.0
β”‚   β”œβ”€β”€ JSON validity:      0.30 (valid dict)
β”‚   β”œβ”€β”€ Schema completeness: 0.30 (fields present / 10)
β”‚   β”œβ”€β”€ Value validity:      0.40 (type checks / 9 checks)
β”‚   └── Sentiment mismatch: -0.20 (nota contradicts sentiment)
β”œβ”€β”€ reward_sql_qa(completion)  β†’ 0.0-1.0
β”‚   β”œβ”€β”€ Tier 1: SQL structure    0.30 (β‰₯3 keywords)
β”‚   β”œβ”€β”€ Tier 2: Query+explanation 0.25 (both present)
β”‚   β”œβ”€β”€ Tier 3: Numerical data   0.25 (concrete numbers)
β”‚   └── Tier 4: Domain coherence  0.20 (30 PT-BR business words)
β”œβ”€β”€ reward_insights(completion)  β†’ 0.0-1.0
β”‚   β”œβ”€β”€ Action words:    0.40 (recomend, melhor, etc.)
β”‚   β”œβ”€β”€ Length 100-800:  0.30
β”‚   β”œβ”€β”€ Structure marks: 0.20 (bullets, headers)
β”‚   └── Domain mention:  0.10 (cliente, produto, etc.)
└── reward_push(completion)  β†’ 0.0-1.0
    β”œβ”€β”€ Length ≀120:      0.50 (steep decay 120-200, zero >200)
    β”œβ”€β”€ PT markers:       0.30 (accented chars, PT words)
    β”œβ”€β”€ Creativity:       0.20 (not generic)
    └── Formal penalty:  -0.20 (prezado, atenciosamente)
```

### Key Files

```
rtferraz/tucano2-commerce/
β”œβ”€β”€ notebooks/
β”‚   β”œβ”€β”€ v4_2_instruct_grpo.ipynb              # Final training notebook
β”‚   └── cell_comparison_base_vs_tuned.py      # Comparison evaluation script
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ PROJECT.md                            # Full project documentation
β”‚   β”œβ”€β”€ ADR-001-next-steps.md                 # Phase 1-3 execution plans
β”‚   β”œβ”€β”€ ADR-002-v4-instruct.md                # V4 instruct model decision
β”‚   β”œβ”€β”€ v4_2-handoff.md                       # V4.2 specification
β”‚   β”œβ”€β”€ reports/
β”‚   β”‚   β”œβ”€β”€ v4_1_run_report.md                # V4.1 training report
β”‚   β”‚   └── v4_2_final_report.md              # ← THIS FILE
β”‚   └── checkpoints/
β”‚       └── 2026-04-23_v3-launch.md           # V3 session checkpoint
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ insert_comparison_cell.py             # Notebook patching utility
β”‚   └── md_to_ipynb.py                        # Format converter
└── models/ (on Vertex AI workbench)
    └── tucano2-0.5B-instruct-grpo-v4.2-seed42/
        β”œβ”€β”€ best_checkpoint/                   # Step 1100 adapter (39MB)
        β”œβ”€β”€ checkpoints/                       # All training checkpoints
        β”œβ”€β”€ eval_results_seed42.json           # Per-task eval results
        └── comparison_base_vs_tuned.json      # Final A/B comparison
```

---

## 11. Final Statement

This project proved that **GRPO with rule-based rewards can meaningfully improve a 0.5B model on domain-specific tasks** β€” but only for tasks where the model already has the underlying capability (text formatting, structure generation), not for tasks requiring new reasoning capacity (SQL generation).

The +15.5% overall improvement is real, reproducible (p=0.0003), and achieved with minimal compute ($18, 22 hours on a single L4 GPU). The methodology β€” evidence-based decisions backed by literature, systematic debugging of reward functions, statistical validation of results β€” is the primary output. The trained adapter is secondary.

The most important lesson: **the reward function is the product specification.** Getting it right β€” through audits, human evaluation, and iterative bug-fixing β€” determines the outcome more than any algorithmic choice. GRPO is just the optimizer; the reward function is the objective.

---

*"V4.2 is the last 0.5B run. Its purpose is not to find more improvement β€” it is to know exactly what was found and why, with enough statistical rigor to say so in writing."*

β€” V4.2 Handoff Document, 2026-04-27