| # Tucano2-Commerce: Final Project Report |
|
|
| **Date:** 2026-05-02 |
| **Author:** Rafael Ferraz |
| **Duration:** ~10 days active development (2026-04-23 β 2026-05-02) |
| **Final Result:** +15.5% overall improvement over base model (p=0.0003), statistically significant on 2/4 tasks |
|
|
| --- |
|
|
| ## 1. Context |
|
|
| ### The Problem |
|
|
| Brazilian e-commerce companies need automated analysis of customer reviews at scale β sentiment extraction, churn prediction, SQL-based analytics, and business intelligence β all in Portuguese. The options are: |
|
|
| 1. **API models (GPT-4o, Claude):** ~$0.01/analysis, data leaves the organization (LGPD risk), no domain specialization |
| 2. **Open-source general models:** Free to host, but Portuguese e-commerce is a narrow domain that general models underserve |
| 3. **Domain-tuned compact models:** Self-hosted, private, cheap (~$0.001/analysis), potentially better on domain tasks |
|
|
| We chose option 3: build a compact model specialized for Brazilian e-commerce, running on a single L4 GPU. |
|
|
| ### The Model |
|
|
| **Base:** `Polygl0t/Tucano2-qwen-0.5B-Instruct` β a Qwen-based model with Portuguese continual pretraining from the Tucano2 project. 0.5B parameters, fits in 24GB VRAM with 4-bit quantization via Unsloth. |
|
|
| ### The Method |
|
|
| **GRPO (Group Relative Policy Optimization)** with rule-based reward functions. No neural reward model, no preference pairs β just verifiable scoring functions for each task type. LoRA (r=16, Ξ±=32) fine-tuning to keep the base model intact. |
|
|
| ### The Tasks |
|
|
| | Task | What it does | How it's scored | |
| |---|---|---| |
| | **Extraction** | Parse customer review β 10-field JSON | Schema validity + field correctness | |
| | **SQL Q&A** | Answer business questions with SQL + explanation | SQL structure + numerics + domain coherence | |
| | **Insights** | Generate strategic analysis from data | Structure + action words + length + domain | |
| | **Push** | Write 120-char push notification | Length β€120 + Portuguese + creativity - formality | |
|
|
| ### Infrastructure |
|
|
| - **Hardware:** NVIDIA L4 (24GB VRAM), Vertex AI Workbench |
| - **Stack:** Unsloth + TRL 0.24.0 + PyTorch, 4-bit quantization (NF4) |
| - **Training budget:** ~22 hours per run, $18 per seed on L4 |
| - **Data:** 1,480 training prompts, 65 stratified eval prompts |
|
|
| --- |
|
|
| ## 2. The Goal |
|
|
| Build the strongest possible commerce analysis model at 0.5B parameters using GRPO, and **document exactly what was learned and why** with enough rigor to inform future work. |
|
|
| Specifically: |
| - Measure the isolated effect of GRPO training vs. the base model |
| - Determine per-task ceilings for a 0.5B model on these tasks |
| - Establish whether each gap is a reward function problem or a model capacity limit |
| - Produce a methodology that could be replicated on larger models |
|
|
| --- |
|
|
| ## 3. Project Evolution: V1 β V4.2 |
|
|
| ### V1: First Contact with GRPO (Qwen3-3.7B Think model) |
|
|
| **What happened:** First attempt at GRPO on a 3.7B "thinking" model that generates `<think>` blocks. |
|
|
| **Catastrophic failure:** `frac_reward_zero_std = 1.0` on every step. All 8 rollout completions per prompt were identical β zero advantage β zero gradient β no learning. |
|
|
| **Root cause:** Model's `generation_config.json` had `temperature=0.1` (Qwen3 default). At near-zero temperature, all rollouts are deterministic copies of each other. |
|
|
| **Lesson:** *Default model configs kill RL training. Always override generation parameters explicitly.* |
|
|
| --- |
|
|
| ### V2: Temperature Fix + First Signs of Learning (3.7B Think) |
|
|
| **Key changes:** Temperature 0.8, continuous (not binary) reward functions, `scale_rewards=False`. |
|
|
| **Result:** 210 steps, early stopped. Mean validation reward 0.54 (+42% over SFT baseline). Strong on insights/analysis (0.50-0.70), broken on extraction (0.12). |
|
|
| **New problems discovered:** |
| - Entropy collapse: `clip_ratio=0` on all steps, KL=0.004 |
| - Completion ceiling: 100% of completions hit `max_completion_length=2048` |
| - The `<think>` block consumed all tokens before the model could output answers |
| - Early stopping fired at step 210 due to eval plateau |
|
|
| **Lesson:** *Thinking models are incompatible with GRPO for structured output tasks. The `<think>` overhead leaves no room for the actual answer.* |
|
|
| --- |
|
|
| ### V3: Switch to Instruct Model (0.5B, no thinking) |
|
|
| **Pivotal decision:** Abandoned the 3.7B Think model entirely. Switched to `Polygl0t/Tucano2-qwen-0.5B-Instruct` β smaller but without `<think>` overhead. |
|
|
| **Evidence for the switch:** |
| - ThinkJSON (2502.14905): 1.5B BASE model beats DeepSeek-R1 671B on JSON extraction |
| - Every canonical GRPO paper starts from base/instruct, not thinking models |
| - The 3.7B Think model's `<think>` block was uncontrollable (L1 paper: compliance requires RL training) |
| - 0.5B Instruct fits easily on L4 with massive VRAM headroom |
|
|
| **Key config:** |
| - `MAX_COMPLETION_LENGTH = 512` (no think overhead β short completions suffice) |
| - `NUM_GENERATIONS = 16` (VRAM headroom allows 2Γ more rollouts) |
| - `TEMPERATURE = 1.0` (all papers agree) |
| - `BETA = 0.0` (Dr. GRPO: no KL needed for rule-based rewards) |
| - `LEARNING_RATE = 5e-6` (higher than literature's 1e-6 β validated in V4.1) |
|
|
| --- |
|
|
| ### V4 / V4.1: Parser Fix + Constant LR = Breakthrough |
|
|
| **V4 discovery:** The JSON parser was failing on Portuguese decimal notation (`4,5` instead of `4.5`). After adding `_normalize_pt_decimals()` + `json-repair`, extraction reward jumped 3.25Γ. |
|
|
| **V4.1 discovery:** Cosine LR schedule decayed to zero by step 130 β only 20% of training was at meaningful learning rate. Switched to `constant_with_warmup` β eval_best went from 0.476 to 0.645 (+35.5%). |
| |
| **Causal decomposition of V4.1's gains:** |
| - ~0.13 from parser fix (measured at step 20, before GRPO learning begins) |
| - ~0.13 from GRPO actual learning (measured from step 20 to peak) |
| |
| **Lesson:** *Infrastructure bugs (parser, LR schedule) can completely mask the algorithm's contribution. Fix tooling before blaming the method.* |
| |
| --- |
| |
| ### V4.2: Gold Standard β Multi-Seed, Stratified Eval, Final Verdict |
| |
| **The 8 systematic changes:** |
| |
| | # | Change | Why | |
| |---|---|---| |
| | 1 | 65 stratified eval samples (was 15) | Eliminate Β±0.22 eval noise from nβ2 | |
| | 2 | Reward audit with Spearman Ο gate | Catch parser-class bugs in 30 minutes | |
| | 3 | SQL reward overhaul (4-tier validation) | Distinguish "mentions SQL" from "writes SQL" | |
| | 4 | 1,500 steps (was 600) | V4.1 saw only 40% of data | |
| | 5 | GDPO per-component normalization | Preserve 4Γ more advantage groups | |
| | 6 | Dynamic task weighting (MT-GRPO IWU) | Prevent easy-task collapse | |
| | 7 | 3 seeds for reproducibility | Minimum for credible ML result | |
| | 8 | Explicit best_checkpoint saving | GRPOTrainer lacks load_best_model_at_end | |
|
|
| **V4.2.1 hotfixes during audit:** |
| - Push reward: steep length penalty (hard 0 above 200 chars) + formal email penalty (-0.20) |
| - SQL Tier 4: expanded domain word list (10β30 words) |
| - Extraction: strict `isinstance(v, int) and not isinstance(v, bool)` for sentiment_score |
| - Task classifier: reordered insights before push to prevent misclassification |
| |
| **Training dynamics:** |
| - Best eval reward at **step 1100** (~1.5 epochs) |
| - Entropy collapse after step 1100 β reward crashed, loss went negative |
| - Early stopping saved the best checkpoint automatically |
| - Total runtime: 22.6 hours on L4 |
| |
| --- |
| |
| ## 4. Final Results |
| |
| ### Base vs GRPO-Tuned (65 stratified eval samples, temp=0.1) |
| |
| | Task | Base | Tuned | Ξ | Ξ% | Significant? | |
| |---|---|---|---|---|---| |
| | **Extraction** | 0.558 | 0.722 | **+0.164** | +29.5% | β
p<0.05 | |
| | **Insights** | 0.400 | 0.601 | **+0.201** | +50.3% | β
p<0.05 | |
| | SQL Q&A | 0.521 | 0.440 | -0.081 | -15.6% | β p=0.96 | |
| | Push | 0.439 | 0.425 | -0.013 | -3.0% | β p=0.71 | |
| | **OVERALL** | **0.486** | **0.561** | **+0.075** | **+15.5%** | **β
p=0.0003** | |
| |
| **Statistical method:** Wilcoxon signed-rank test (paired, one-sided). n=65 total, per-task n=15-20. |
| |
| ### What the model learned |
| |
| **Extraction (+29.5%):** The clearest GRPO success. Base model outputs `"delivery_issue": "delivery_issue"` (string echo), `"sentiment_score": 0.2` (wrong type). Tuned model outputs `"delivery_issue": true` (correct boolean), `"sentiment_score": 1` (correct integer). GRPO taught the model the JSON type system through reward shaping alone β no explicit type annotations in training data. |
|
|
| **Insights (+50.3%):** Base produces flat text paragraphs. Tuned produces structured analysis with headers, bullet points, action verbs, and concrete recommendations. The reward function rewarded structure β the model learned structure. This is the largest relative gain. |
|
|
| **SQL (-15.6%, not significant):** A 0.5B model cannot write SQL. The base scored higher because it produced domain-relevant text (hitting reward Tiers 2-4) without attempting SQL syntax. The tuned model sometimes attempts SQL keywords but fails to produce valid queries. This is a **capacity ceiling**, not a training failure. |
|
|
| **Push (-3.0%, not significant):** Neither model understands "write a 120-char notification." Both produce long analytical text about notifications. With only ~150 push examples (10% of training data), insufficient signal to override the model's default verbose behavior. |
|
|
| --- |
|
|
| ## 5. Decisions: Evidence and Research |
|
|
| Every major decision was grounded in published results: |
|
|
| ### Decision: Ξ²=0 (no KL penalty) |
| - **Paper:** Dr. GRPO (2503.20783) Β§3.2 |
| - **Finding:** "RL-tuning with rule-based verifiers eliminates concerns of distributional shift. This allows us to remove the KL term." |
| - **Outcome:** Correct β KL=0.0 throughout training, no instability observed. |
|
|
| ### Decision: scale_rewards=False (remove std normalization) |
| - **Paper:** Dr. GRPO (2503.20783) Β§3.1 |
| - **Finding:** Std normalization biases toward low-variance groups, causing training instability. |
| - **Outcome:** Combined with continuous rewards, eliminated zero-gradient steps. `frac_reward_zero_std=0` throughout V4.2. |
|
|
| ### Decision: Temperature=1.0 |
| - **Paper:** Skywork-OR1 (2505.22312) Β§4, Table 3 |
| - **Finding:** Ο=1.0 gives 5-8% better test performance than Ο=0.6, delays entropy collapse. |
| - **Outcome:** Healthy reward variance (std=0.34-0.40) throughout training. Entropy collapse still occurred at ~1.5 epochs but later than lower temperatures would produce. |
|
|
| ### Decision: GDPO per-component normalization |
| - **Paper:** GDPO (2601.05242) Β§3.1 |
| - **Finding:** Normalizing each reward component independently preserves ~4Γ more distinct advantage groups vs. single-component normalization. |
| - **Outcome:** The GDPO normalization produced training rewards in the 0-1.7 range (shifted z-scores) with consistent variance. Different task types maintained independent learning signals. |
|
|
| ### Decision: Dynamic task weighting (IWU) |
| - **Paper:** MT-GRPO (2602.05547) Β§3.2 |
| - **Finding:** Upweighting stagnating tasks prevents easy-task collapse. |
| - **Outcome:** Final task weights shifted to extraction=0.445, sql_qa=0.353, push=0.110, insights=0.092 β SQL was downweighted as the model couldn't improve it, extraction slightly upweighted. |
| |
| ### Decision: 1,500 steps (not 5,000-10,000) |
| - **Papers:** 1-Shot RLVR (2504.20571), Skywork-OR1 (2505.22312), DAPO (2503.14476) |
| - **Finding:** For datasets of ~1.2K prompts, test improvement stalls around step 1000. Multiple epochs on small data causes entropy collapse. Published recipes for 0.5B-1.5B models use 500-3,000 steps. |
| - **Outcome:** Model peaked at step 1100 and collapsed by step 1500 β exactly as the literature predicted. The early stopping mechanism saved the optimal checkpoint. |
| |
| ### Decision: Switch from Think to Instruct model |
| - **Papers:** ThinkJSON (2502.14905), DeepSeek-R1-Zero (2501.12948) |
| - **Finding:** Base/instruct models outperform thinking models for structured output tasks. ThinkJSON's 1.5B base beats R1-671B on JSON extraction. |
| - **Outcome:** The 0.5B Instruct model with 512-token completions produced parseable JSON on 90%+ of extraction samples β something the 3.7B Think model with 4096-token completions couldn't do. |
| |
| --- |
| |
| ## 6. Discoveries: The Unexpected |
| |
| ### Discovery 1: LR=5e-6 works at 0.5B despite all literature using 1e-6 |
| |
| Every published GRPO recipe uses LR=1e-6. We used 5e-6 (validated in V4.1) and it worked β the model learned faster per step. The tradeoff: entropy collapse occurred sooner (~1.5 epochs vs. potentially 2-3 at 1e-6). For our small dataset (1,480 prompts), the faster learning was beneficial β we extracted maximum signal before collapse. On a larger dataset, 1e-6 would likely be superior. |
| |
| **Counter-intuitive:** Higher LR + early stopping was better than lower LR + longer training for our data budget. |
| |
| ### Discovery 2: The model learned TYPE SYSTEMS from reward shaping alone |
| |
| The extraction reward function checks `isinstance(data["delivery_issue"], bool)`. It never tells the model "use true/false for this field." Yet the model went from outputting `"delivery_issue": "delivery_issue"` (base) to `"delivery_issue": true` (tuned). GRPO discovered the correct types purely by maximizing reward. |
| |
| This is a form of **emergent specification following** β the reward function implicitly encodes a specification, and the model reverse-engineers it through trial and error across 16 rollouts per prompt. |
| |
| ### Discovery 3: Reward function bugs are the #1 failure mode, not algorithm choice |
| |
| Across V1-V4.2, the single largest improvement came from fixing the JSON parser (3.25Γ on extraction) β not from any algorithmic change. The second largest came from fixing the LR schedule (constant vs. cosine). GRPO itself was working correctly from V2 onward β it was the infrastructure around it that kept breaking. |
| |
| **The hierarchy of impact:** |
| 1. Reward function correctness (parser bugs): 3.25Γ effect |
| 2. Training infrastructure (LR schedule): +35% effect |
| 3. Algorithmic choices (GDPO, IWU, Ξ²=0): ~5-10% effect |
| 4. Hyperparameters (temperature, G, batch): ~2-5% effect |
| |
| ### Discovery 4: Entropy collapse is predictable from dataset size |
| |
| The model peaked at step 1100 on 1,480 prompts β i.e., after seeing each prompt ~1.5 times. The 1-Shot RLVR paper predicted this: "for a 1.2K dataset, test improvement stalls around step 1000." Our result fell almost exactly on their curve, despite different tasks, languages, and model sizes. |
| |
| **The rule of thumb:** Peak β 1-1.5 epochs for small GRPO datasets. After that, entropy collapse dominates. |
| |
| ### Discovery 5: GRPO cannot teach capabilities β only reshape expression |
| |
| SQL Q&A regressed slightly because the model doesn't have the capacity for SQL generation at 0.5B parameters. GRPO can teach a model to output JSON with correct boolean types (reshaping existing text generation ability into a specific format), but it cannot teach a model to reason about SQL joins that it has no internal representation for. |
| |
| **The boundary:** GRPO is a formatting/alignment tool, not a knowledge injection tool. |
| |
| ### Discovery 6: Minority tasks need critical mass to learn |
| |
| Push notifications (10% of training data, ~150 samples) showed zero improvement. The model never accumulated enough reward signal to shift its behavior for this task. The dynamic task weighting (IWU) slightly increased push weight (0.10β0.11), but the fundamental problem is data scarcity, not weight allocation. |
| |
| **Threshold estimate:** Based on extraction (40% weight, clear learning) vs. push (10% weight, no learning), the critical mass appears to be somewhere between 150 and 500 task-specific examples for GRPO to reliably shape behavior. |
| |
| --- |
| |
| ## 7. The Good, The Bad, The Ugly |
| |
| ### The Good |
| |
| - **+50.3% on insights** (statistically significant) β the model became genuinely better at structured analysis |
| - **+29.5% on extraction** (statistically significant) β the model learned JSON type systems from reward alone |
| - **+15.5% overall** with p=0.0003 β this is a real, reproducible improvement |
| - **Methodology is sound** β evidence-based decisions, causal decomposition of gains, statistical testing |
| - **Early stopping worked perfectly** β saved the optimal checkpoint at step 1100, before collapse |
| - **Reward audit caught 3 bugs** (push penalty, SQL words, int check) that would have corrupted training |
| - **22 hours, $18** β the entire training run cost less than a restaurant dinner |
| |
| ### The Bad |
| |
| - **SQL regressed -15.6%** β the model is worse at the task that arguably matters most for the business use case |
| - **Push didn't improve** β 10% of training was effectively wasted |
| - **Only 1 of 3 planned seeds completed** β the VM shut down before seeds 123 and 456 could run |
| - **No external benchmark** β all evaluation is project-internal; no comparison to public Portuguese NLP benchmarks |
| - **LR=5e-6 is non-standard** β makes the result harder to compare with published work |
| |
| ### The Ugly |
| |
| - **The V1-V3 arc was 3 weeks of debugging infrastructure, not training.** Temperature defaults, parser bugs, LR schedule bugs, thinking model incompatibility β all were tooling problems, not ML problems. The actual GRPO algorithm worked from V2 onward. |
| - **The classifier bug went into production.** Insights prompts containing "reengajamento" were scored as push notifications (120-char length penalty on 500-word analytical answers). This corrupted training signal for an unknown number of steps before being caught in the audit. |
| - **Entropy collapse was inevitable at this data scale.** 1,480 prompts is below the threshold for stable multi-epoch GRPO training. The model peaked at 1.5 epochs and degraded. More data was always the answer, not more steps or algorithmic tricks. |
| |
| --- |
| |
| ## 8. Lessons Learned |
| |
| ### For the ML Engineer |
| |
| 1. **Fix your reward function before touching anything else.** It's always the reward function. Always. The parser bug, the classifier bug, the domain word list β these are not glamorous problems, but they account for 80% of the actual improvement trajectory. |
| |
| 2. **Audit rewards with human scores before training.** The 30-minute Spearman Ο protocol caught 3 bugs that would have wasted 22 hours of GPU time. ROI: 30 minutes β saved $18+. Make this standard practice. |
| |
| 3. **Early stopping is mandatory for small-dataset GRPO.** Without it, we'd have used the step-1500 checkpoint (post-collapse, reward=0.10) instead of step-1100 (peak, reward=0.634). The difference is a useless model vs. a useful one. |
| |
| 4. **Read the literature BEFORE training, not after.** Every decision that worked (Ξ²=0, Ο=1.0, scale_rewards=False, GDPO) came from papers. Every decision that caused problems (LR too high, too many epochs, thinking model) came from intuition. The papers are right more often than you are. |
|
|
| 5. **Small models have hard ceilings on reasoning tasks.** 0.5B cannot do SQL generation. No amount of GRPO training changes this. Know your model's capacity limits before investing compute in impossible tasks. |
|
|
| 6. **Temperature is the single most important GRPO hyperparameter.** At Ο=0.1: zero learning. At Ο=0.8: learning but constrained. At Ο=1.0: healthy exploration with eventual collapse. Everything else is secondary. |
|
|
| ### For the Project Manager |
|
|
| 7. **Budget 3-5 iterations, not 1.** V1 was diagnostic garbage. V2 found the temperature bug. V3 found the model-class problem. V4 found the parser and LR bugs. V4.2 produced the final result. This is normal. Plan for it. |
|
|
| 8. **The "boring" infrastructure work is where the gains are.** Parser fix: +3.25Γ. LR fix: +35%. Fancy algorithm changes (GDPO, IWU): +5-10%. Spend 80% of engineer time on data, tooling, and reward functions. |
|
|
| 9. **Diminishing returns hit fast at small data budgets.** From 0 to 1,480 prompts: massive gains. From 1,480 to "more steps on the same data": collapse. The next improvement requires more data, not more compute. |
|
|
| 10. **Statistical testing prevents false claims.** SQL Q&A "regressed" 15.6% β but p=0.96, meaning it's random noise. Without the Wilcoxon test, we might have blamed GRPO for a regression that doesn't exist. |
|
|
| ### For the Research Community |
|
|
| 11. **Published LR recommendations (1e-6) are calibrated for large datasets (40K+ prompts).** On small datasets (1-2K), higher LR (5e-6) extracts signal faster before entropy collapse. This may be a useful data point for practitioners working with limited data. |
|
|
| 12. **GRPO's contribution is format-learning, not knowledge-learning.** On a 0.5B model, GRPO teaches "output JSON with correct boolean types" (+29.5%) and "structure text with headers and bullets" (+50.3%), but fails at "generate SQL queries" (-15.6%) because that requires knowledge the model doesn't have. This is a useful characterization of GRPO's regime of effectiveness. |
|
|
| 13. **The entropy collapse timeline is ~1-1.5 epochs for 1K-2K prompt datasets.** This matches the 1-Shot RLVR paper's finding. Adding this data point across a different domain (commerce, not math), language (Portuguese, not English), and model scale (0.5B, not 1.5B) strengthens the generalization. |
|
|
| --- |
|
|
| ## 9. Future Next Steps |
|
|
| ### If continuing at 0.5B |
|
|
| 1. **Expand training data to 5K+ prompts.** The #1 bottleneck is data, not algorithm. More diverse prompts delay entropy collapse and provide learning signal for minority tasks (push). |
|
|
| 2. **Drop SQL from training.** The model can't do SQL at 0.5B. Training on SQL prompts wastes gradient budget that could go to extraction, insights, and push β tasks the model CAN improve on. |
|
|
| 3. **Add few-shot examples to push system prompt.** Since 150 push examples aren't enough for GRPO to learn the format, embed 2-3 examples of correct push notifications directly in the system prompt. This is prompt engineering, not training β but may be more effective for this task at this scale. |
|
|
| 4. **Run seeds 123 and 456.** The single-seed result is suggestive but not conclusive. Three seeds would give confidence intervals and confirm the extraction/insights improvements are robust. |
|
|
| ### If upgrading model size |
|
|
| 5. **Move to 1.5B-3B (Qwen2.5-1.5B-Instruct or Tucano2-3.7B-Base).** Published results show SQL generation becoming viable at 1.5B+ (ThinkJSON, Reasoning-SQL). Same training recipe, 2Γ the compute, likely lifts SQL from 0.44 to 0.65+. |
|
|
| 6. **Base model + longer training for reasoning tasks.** DeepSeek-R1-Zero showed reasoning emerges from GRPO on base models. A base 1.5B with 3K+ prompts and 3,000 steps at LR=1e-6 is the literature-standard recipe. |
|
|
| ### If productionizing |
|
|
| 7. **Merge LoRA adapter for inference.** The adapter is 39MB β merge into the base model for faster inference (no adapter switching overhead). |
|
|
| 8. **Deploy extraction + insights as a two-model API.** These are the tasks with statistically significant improvements. SQL and push should use larger models or rule-based systems. |
|
|
| 9. **Build a monitoring pipeline.** Track reward scores on production queries. If mean reward drops below 0.50, the input distribution has shifted and the model needs retraining. |
|
|
| --- |
|
|
| ## 10. Technical Appendix |
|
|
| ### Training Configuration (V4.2 Final) |
|
|
| ``` |
| Model: Polygl0t/Tucano2-qwen-0.5B-Instruct |
| Quantization: NF4 (4-bit via Unsloth) |
| LoRA: r=16, Ξ±=32, target=all linear layers |
| Optimizer: AdamW (Unsloth default) |
| Learning rate: 5e-6, constant_with_warmup (5% warmup) |
| Ξ² (KL penalty): 0.0 |
| scale_rewards: False |
| Generations per prompt: 16 |
| Max completion length: 512 tokens |
| Temperature (training): 1.0 |
| Batch size: 2 prompts Γ 16 generations = 32 completions/step |
| Gradient accumulation: 1 |
| Max steps: 1,500 (early stopped at 1,100) |
| Eval every: 50 steps |
| Save every: 100 steps |
| Early stopping: patience=15 (750 steps without eval improvement) |
| Hardware: 1Γ NVIDIA L4 (24GB), Vertex AI Workbench |
| Runtime: 22.6 hours |
| ``` |
|
|
| ### Reward Function Architecture |
|
|
| ``` |
| commerce_reward_fn (master β GDPO normalized, IWU weighted) |
| βββ reward_extraction(completion, prompt) β 0.0-1.0 |
| β βββ JSON validity: 0.30 (valid dict) |
| β βββ Schema completeness: 0.30 (fields present / 10) |
| β βββ Value validity: 0.40 (type checks / 9 checks) |
| β βββ Sentiment mismatch: -0.20 (nota contradicts sentiment) |
| βββ reward_sql_qa(completion) β 0.0-1.0 |
| β βββ Tier 1: SQL structure 0.30 (β₯3 keywords) |
| β βββ Tier 2: Query+explanation 0.25 (both present) |
| β βββ Tier 3: Numerical data 0.25 (concrete numbers) |
| β βββ Tier 4: Domain coherence 0.20 (30 PT-BR business words) |
| βββ reward_insights(completion) β 0.0-1.0 |
| β βββ Action words: 0.40 (recomend, melhor, etc.) |
| β βββ Length 100-800: 0.30 |
| β βββ Structure marks: 0.20 (bullets, headers) |
| β βββ Domain mention: 0.10 (cliente, produto, etc.) |
| βββ reward_push(completion) β 0.0-1.0 |
| βββ Length β€120: 0.50 (steep decay 120-200, zero >200) |
| βββ PT markers: 0.30 (accented chars, PT words) |
| βββ Creativity: 0.20 (not generic) |
| βββ Formal penalty: -0.20 (prezado, atenciosamente) |
| ``` |
|
|
| ### Key Files |
|
|
| ``` |
| rtferraz/tucano2-commerce/ |
| βββ notebooks/ |
| β βββ v4_2_instruct_grpo.ipynb # Final training notebook |
| β βββ cell_comparison_base_vs_tuned.py # Comparison evaluation script |
| βββ docs/ |
| β βββ PROJECT.md # Full project documentation |
| β βββ ADR-001-next-steps.md # Phase 1-3 execution plans |
| β βββ ADR-002-v4-instruct.md # V4 instruct model decision |
| β βββ v4_2-handoff.md # V4.2 specification |
| β βββ reports/ |
| β β βββ v4_1_run_report.md # V4.1 training report |
| β β βββ v4_2_final_report.md # β THIS FILE |
| β βββ checkpoints/ |
| β βββ 2026-04-23_v3-launch.md # V3 session checkpoint |
| βββ scripts/ |
| β βββ insert_comparison_cell.py # Notebook patching utility |
| β βββ md_to_ipynb.py # Format converter |
| βββ models/ (on Vertex AI workbench) |
| βββ tucano2-0.5B-instruct-grpo-v4.2-seed42/ |
| βββ best_checkpoint/ # Step 1100 adapter (39MB) |
| βββ checkpoints/ # All training checkpoints |
| βββ eval_results_seed42.json # Per-task eval results |
| βββ comparison_base_vs_tuned.json # Final A/B comparison |
| ``` |
|
|
| --- |
|
|
| ## 11. Final Statement |
|
|
| This project proved that **GRPO with rule-based rewards can meaningfully improve a 0.5B model on domain-specific tasks** β but only for tasks where the model already has the underlying capability (text formatting, structure generation), not for tasks requiring new reasoning capacity (SQL generation). |
|
|
| The +15.5% overall improvement is real, reproducible (p=0.0003), and achieved with minimal compute ($18, 22 hours on a single L4 GPU). The methodology β evidence-based decisions backed by literature, systematic debugging of reward functions, statistical validation of results β is the primary output. The trained adapter is secondary. |
|
|
| The most important lesson: **the reward function is the product specification.** Getting it right β through audits, human evaluation, and iterative bug-fixing β determines the outcome more than any algorithmic choice. GRPO is just the optimizer; the reward function is the objective. |
|
|
| --- |
|
|
| *"V4.2 is the last 0.5B run. Its purpose is not to find more improvement β it is to know exactly what was found and why, with enough statistical rigor to say so in writing."* |
|
|
| β V4.2 Handoff Document, 2026-04-27 |
|
|