File size: 41,755 Bytes
4312bfd | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 | # Tucano2-Commerce: Comprehensive Project Investigation Report
**Date:** 2026-04-25
**Scope:** Full audit of training performance, identified issues, unexplored alternatives, and actionable recommendations
**Repositories Audited:**
- [`rtferraz/tucano2-commerce`](https://hf.co/rtferraz/tucano2-commerce) β Main project repo (docs, notebooks, scripts)
- [`rtferraz/commerce-model-qwen3.5-lora`](https://hf.co/rtferraz/commerce-model-qwen3.5-lora) β Qwen3.5-9B SFT LoRA adapter
- [`rtferraz/parameter-golf-v2`](https://hf.co/rtferraz/parameter-golf-v2) β Separate competition project (parameter-efficient LM)
---
## Table of Contents
1. [Executive Summary](#1-executive-summary)
2. [Project Architecture & Timeline](#2-project-architecture--timeline)
3. [Every Change That Improved Performance](#3-every-change-that-improved-performance)
4. [Every Issue That Needs Improvement](#4-every-issue-that-needs-improvement)
5. [Every Invaluable Lesson Learned](#5-every-invaluable-lesson-learned)
6. [Every Good Aspect of This Model/Training](#6-every-good-aspect-of-this-modeltraining)
7. [Unexplored Alternatives β What You Haven't Tried Yet](#7-unexplored-alternatives--what-you-havent-tried-yet)
8. [Literature-Backed Recommendations](#8-literature-backed-recommendations)
9. [Risk Assessment](#9-risk-assessment)
10. [Conclusion & Priority Roadmap](#10-conclusion--priority-roadmap)
---
## 1. Executive Summary
The Tucano2-Commerce project aims to build a compact (3.7B parameter) domain-specialized LLM for Brazilian e-commerce analysis β sentiment, JSON extraction, SQL generation, churn prediction, and business insights β all in Portuguese. The pipeline follows the DeepSeek-R1 paradigm: **Base β SFT β GRPO**.
### Key Numbers
| Metric | Value |
|--------|-------|
| Base model | Polygl0t/Tucano2-qwen-3.7B-Think (Qwen3-4B β Portuguese CPT β SFT+Think) |
| SFT data | ~1,650 domain-specific samples |
| GRPO v2 data | 300 prompts (subset) |
| GRPO v3 data | ~1,404 prompts (full) |
| v2 best eval reward | 0.125 (eval) / 0.54 (validation mean) β **+42% over SFT baseline** |
| v2 training steps | 210/300 (early stopped) |
| v2 duration | 14.9 hours on NVIDIA L4 |
| v3 status | Launched (~500 steps, ~25h estimated) |
| Critical issues | Entropy collapse, completion length ceiling, thinking model overhead |
| Hardware | Single NVIDIA L4 (24GB VRAM) |
### Verdict
The project demonstrates strong engineering discipline and research-driven decision-making. The +42% improvement over SFT baseline is real. However, three structural issues β **entropy collapse, thinking model incompatibility with structured output, and data scale** β are capping performance. The v3 run addresses these partially, but the literature points to several unexplored approaches that could yield substantially better results.
---
## 2. Project Architecture & Timeline
### Model Lineage
```
Qwen/Qwen3-4B-Base
ββ Polygl0t/Tucano2-qwen-3.7B-Base (Portuguese continual pretraining, 320B tok corpus)
ββ Polygl0t/Tucano2-qwen-3.7B-Think (SFT + thinking training, GigaVerbo-v2)
ββ YOUR SFT adapter (domain e-commerce, ~1,650 samples)
ββ GRPO v1 (first attempt β killed, zero-signal bug)
ββ GRPO v2 (210 steps, +42% over SFT)
ββ GRPO v3 (launched, all fixes from ADR-001)
```
### Separate Project: Qwen3.5-9B LoRA
```
Qwen/Qwen3.5-9B
ββ rtferraz/commerce-model-qwen3.5-lora (LoRA: r=16, Ξ±=32, 111MB adapter)
```
This is a separate SFT experiment on a larger model (9B). The adapter config shows standard LoRA targeting all linear layers (q,k,v,o,gate,up,down projections) with r=16, Ξ±=32, no dropout. No training metrics or README details were saved β only the default Unsloth template.
### Separate Project: Parameter Golf v2
A competition entry for parameter-efficient language modeling (BPB metric). Uses Int6 GPTQ quantization, SP8192 tokenizer, parallel residual architecture, depth recurrence, Muon optimizer, and TTT (test-time training). Sophisticated work β shows strong systems engineering capability.
---
## 3. Every Change That Improved Performance
### 3.1 Binary β Continuous Reward Functions β
(+50% training signal)
| Before | After | Evidence |
|--------|-------|----------|
| Binary rewards (0/1) | Continuous rewards (0.0-1.0) with partial credit | `reward_std=0` dropped from 50% to ~10% of steps |
**Why it worked:** Binary rewards create groups where all completions get 0 or all get 1. With GRPO's group-relative normalization, zero variance β zero advantage β zero gradient. Continuous rewards ensure reward variance exists in nearly every group.
**Paper backing:** Dr. GRPO (2503.20783) Β§3.1 proves that std-based normalization amplifies this problem β groups with low std get inflated gradient contributions.
### 3.2 Temperature 0.1 β 0.8 β
(Training went from non-functional to functional)
| Before | After | Evidence |
|--------|-------|----------|
| temp=0.1 (Qwen3 default in `generation_config.json`) | temp=0.8 | `frac_reward_zero_std` went from **1.0** (every step) to **~0.0** |
**Why it worked:** Low temperature makes all G=8 rollouts near-identical β zero reward variance β zero advantage β zero gradient. This was the single most destructive bug β the entire v1 run produced **zero learning signal**.
**Paper backing:** Skywork-OR1 (2505.22312) Β§3.1: Ο=1.0 "enhances exploration capability and improves learning plasticity." Their ablation (Β§3.2.4) shows Ο=0.6 immediately enters low-entropy state.
### 3.3 `scale_rewards=False` (Dr. GRPO fix) β
| Before | After | Evidence |
|--------|-------|----------|
| Default GRPO std normalization | Removed std normalization | More stable training; eliminated most zero-gradient steps |
**Why it worked:** Standard GRPO divides advantages by std(rewards) per group. When a group has near-uniform rewards, the small std inflates gradients β training instability + bias toward "easy" prompts (the difficulty bias).
**Paper backing:** Dr. GRPO (2503.20783) Β§3.1 formally proves this bias and shows removing it achieves SOTA 43.3% on AIME 2024 with 7B.
### 3.4 EVAL_MAX_TOKENS 256 β 2048 β
(Prevented premature early stopping)
| Before | After | Evidence |
|--------|-------|----------|
| 256 eval tokens | 2048 eval tokens | Training ran to 210 steps vs. killed at step 40 |
**Why it worked:** The Think model needs 500-700+ tokens just for `</think>`. At 256 tokens, eval always scored incomplete generations β flat eval metrics β early stopping fired after 3 evals β killed training at step 40.
### 3.5 Early Stopping Patience 3 β 10 β
| Before | After | Evidence |
|--------|-------|----------|
| 3 consecutive evals | 10 consecutive evals | 100 steps of runway before halt (was 30) |
**Why it worked:** GRPO training is noisy β reward doesn't monotonically improve. Patience=3 was too aggressive.
### 3.6 `UnslothGRPOTrainer` Wrapper β
(~2-3Γ generation speedup)
Wraps `_generate()` with `for_inference()`/`for_training()` to activate Unsloth's optimized Triton kernels during generation. Without this: ~3-4 tok/s. With: ~8-15 tok/s on L4.
### 3.7 `processing_class=tokenizer` Fix β
In TRL 0.24.0, passing `tokenizer=tokenizer` to GRPOTrainer was silently dropped. Changed to `processing_class=tokenizer`. Without this fix, the eval callback received `None` as tokenizer.
### 3.8 Reward Normalization (extraction reward capped to 1.0) β
The extraction reward function originally scored up to 2.0 while others maxed at 1.0 β extraction gradients were 2Γ larger β biased optimization toward extraction at the expense of other tasks.
**Paper backing:** MO-GRPO (2509.22047) Theorem 1 proves that GRPO advantages are more correlated with higher-variance reward components. Unnormalized rewards cause exactly this.
### 3.9 v3 Changes (Launched, Awaiting Results)
| Change | From | To | Paper |
|--------|------|----|-------|
| Temperature | 0.8 | 1.0 | Skywork-OR1 |
| max_completion_length | 2048 | 4096 | Dr. GRPO |
| num_generations | 8 | 4 | MC-GRPO (VRAM tradeoff) |
| learning_rate | 5e-7 | 2e-6 | Dr. GRPO Appendix G |
| Ξ² (KL penalty) | implicit | 0.0 | Dr. GRPO Β§3.2 |
| Training data | 300 subset | ~1,400 (all) | Scale fix |
| System prompts | generic | 4 task-aware | OptimalThinkingBench |
| Think efficiency reward | none | `reward_think_efficiency()` | L1 paper |
| Zero-advantage groups | included | noise injection (Ο=0.005) | Skywork-OR1 |
| grad_accum | 2 | 1 | Effective batch 4 |
---
## 4. Every Issue That Needs Improvement
### 4.1 π΄ CRITICAL: Entropy Collapse (clip_ratio=0 on ALL steps)
**Evidence:** v2 logs show `clip_ratio=0` on every single training step. KL divergence = 0.004. The policy barely moved from the SFT initialization.
**What this means:** The PPO clipping mechanism is designed to prevent the policy from moving too far. But clip_ratio=0 means the policy **never even approached** the clipping boundary β it's not that clipping is preventing movement, it's that the policy has no gradient signal large enough to push it.
**Root cause analysis (with paper evidence):**
1. **DAPO (2503.14476) Β§3.1 β Clip-Higher:** Standard PPO clips at [1-Ξ΅, 1+Ξ΅]. For low-probability "exploration" tokens (p=0.01), the upper bound is only 0.012 β the token can barely increase its probability. Meanwhile, high-probability "exploitation" tokens (p=0.9) can go to 1.08. This asymmetry means the upper clip restricts exploration far more than exploitation. DAPO proposes **decoupled clip** with Ξ΅ββw=0.2, Ξ΅βα΅’gβ=0.28 β wider upper clip to encourage exploration.
2. **Skywork-OR1 (2505.22312) Β§4:** On-policy training significantly slows entropy collapse. Off-policy updates (multiple gradient steps per rollout) accelerate it. Your current setup does 1 gradient step per rollout (on-policy) β this is correct but insufficient without the entropy bonus.
3. **EDGE-GRPO (2507.21848):** Even with temperature=1.0, models can still collapse to near-deterministic output. The paper proposes Entropy-Driven Advantage (EDA) β dividing advantages by normalized per-response entropy, which amplifies the advantage of diverse responses.
**How to fix:**
- **Add explicit entropy bonus to loss** (Skywork-OR1 MAGIC loss, Eq. 3.1): `Ξ±_k * H_ij^t(ΞΈ)` where Ξ± starts at 5e-3 and decays. This requires modifying the loss function.
- **Implement DAPO's Clip-Higher:** Set Ξ΅ββw=0.2, Ξ΅βα΅’gβ=0.28 (or even higher). This is a TRL config change if supported, or requires trainer subclass.
- **Filter zero-advantage groups** completely (Skywork-OR1 Β§3.1), not just add noise. Remove entire prompts where all G completions get identical rewards.
### 4.2 π΄ CRITICAL: Thinking Model Incompatibility with Structured Output
**Evidence:**
- v2 calibration: 8/8 samples hit 2048 ceiling
- v3 calibration (temp=0.7): 8/8 samples hit 4096 ceiling, both extraction samples stuck in `<think>`
- Prompt-level control ("NΓ£o pense em excesso") had **zero measurable effect** at inference time
- L1 paper (2503.04697) confirms: untrained models ignore length instructions
**Root cause:** The Think model's chat template **always injects `<think>`** on the last assistant turn β there is no `enable_thinking` conditional (unlike official Qwen3-4B). The model was trained to think extensively, and this behavior is deeply embedded in its weights.
**Why this matters for extraction/SQL:**
- Extraction needs ~50-100 tokens of output (JSON). The model produces 2000-3000 tokens of `<think>` first.
- At temp=0.1 (inference), the model deterministically fills the entire context with thinking.
- At temp=1.0 (training), completions are shorter (358-528 tokens avg) β but this creates a **train-test distribution mismatch**.
**How to fix:**
1. **Switch to Base model** β `Polygl0t/Tucano2-qwen-3.7B-Base`. Every canonical GRPO paper starts from base/instruct, not thinking models. DeepSeek-R1-Zero proved thinking emerges from RL. ThinkJSON (2502.14905) beats R1-671B on JSON extraction using Qwen2.5-1.5B-Base + GRPO. This requires re-running SFT (LoRA adapters are model-specific).
2. **Hybrid deployment** β Use Think model for insights (where thinking adds value), Base model for extraction/SQL/push (where thinking hurts).
3. **Modify chat template** β Fork the template to conditionally disable `<think>` injection for extraction/push tasks. This is a workaround, not a fix.
### 4.3 π‘ MODERATE: Data Scale (300β1400, Still Below Literature Minimum)
**Evidence:**
- v2: 300 prompts β early stopping at step 210 (70% of one epoch)
- v3: ~1,400 prompts β 500 steps planned
- Literature minimum: Skywork-OR1 uses 30K+ prompts. DeepSeek-R1 uses 600K+.
**Why this matters:** With 1,400 prompts, the model sees each prompt only once. There's no second-epoch reinforcement. The reward signal is thin β each task type has only 100-650 examples.
**How to fix:**
- **Synthetic data augmentation** using GPT-4o or the SFT model itself (planned in ADR-001)
- **Data mixing with general reasoning** β Cocktail Effect paper (2410.01109) shows 30% general data improves domain by 2-15%
- Target **5,000+ prompts** for meaningful multi-epoch training
### 4.4 π‘ MODERATE: Multi-Task Reward Interference
**Evidence:**
- Bimodal performance: insights/analysis (0.50-0.70) vs. extraction (0.12)
- Extraction reward was previously 2Γ the scale of other rewards (fixed in v2)
- v3 uses single composite reward summing all components
**Root cause (paper evidence):**
1. **MO-GRPO (2509.22047) Theorem 1:** In standard GRPO, the advantage function is more correlated with reward components that have higher variance. If `reward_insights` has variance 0.1 but `reward_extraction` has variance 0.01 (because extraction either works or doesn't), GRPO will preferentially optimize for insights.
2. **GDPO (2601.05242) Β§3.1:** When GRPO sums multiple rewards before normalization, distinct reward combinations can map to identical advantages β losing information. E.g., (format=0, content=1) and (format=1, content=0) both sum to 1 β same advantage, despite being completely different errors.
3. **Multi-Task GRPO (2602.05547) Β§3:** Standard average reward maximization can allow large gains on easy tasks to compensate for stagnation on hard tasks. Their formulation explicitly bounds inter-task performance disparity.
**How to fix:**
- **GDPO:** Normalize each reward component separately before summing. This preserves fine-grained advantage distinctions.
- **Multi-Task GRPO:** Dynamic task weighting that upweights underperforming tasks (extraction) and downweights saturating tasks (insights).
- **Conditional rewards:** Gate easier rewards (format) on harder ones (content accuracy). Model only receives format reward if content is above a threshold (GDPO Β§3.2, Eq. 8).
### 4.5 π‘ MODERATE: No Formal Benchmark
**Evidence:** Evaluation uses 5 held-out prompts scored by the reward function itself. There's no independent benchmark with ground truth, no comparison against baselines (Qwen3-3.7B base, GPT-4o), no standardized metrics.
**How to fix:** Phase 1 of ADR-001 is well-designed (80 prompts, per-task scorers, multiple baselines). Execute it.
### 4.6 π’ MINOR: TRL 0.24.0 Lock
**Evidence:** Pinned to TRL 0.24.0 for Unsloth compatibility. Newer TRL versions have:
- Native `entropy_coeff` in GRPOConfig
- Better logging (clip ratios per-positive/negative)
- Bug fixes for generation config handling
**How to fix:** Either upgrade Unsloth or implement needed features via callbacks/trainer subclass (v3 already does this for entropy monitoring).
### 4.7 π’ MINOR: Single GPU Training Bottleneck
**Evidence:**
- Smoke test: 318s/step β 13.2h for 75 steps
- v2 full run: ~4.3 min/step β 14.9h for 210 steps
- v3 estimated: ~3 min/step β 25h for 500 steps
With G=4 and max_completion_length=4096, generation dominates training time. vLLM was available but not used (`USE_VLLM=False`).
**How to fix:**
- Enable vLLM colocate mode for faster generation
- Consider multi-GPU setup (2ΓL4 or A100) for generation parallelism
---
## 5. Every Invaluable Lesson Learned
### 5.1 Technical Lessons
1. **Default model generation configs will silently destroy your RL training.** Qwen3's `generation_config.json` sets `temperature=0.1`. This single default was responsible for the complete failure of v1. **Always explicitly override** every generation parameter.
2. **The reward function is the product specification.** Binary rewards β zero signal. Continuous rewards with partial credit β training works. Multi-component rewards with staged convergence β format learns first, content follows. The time spent designing rewards is the most valuable engineering time.
3. **GRPO needs diversity to learn β diversity in completions AND diversity in prompts.** Low temperature β identical completions β zero advantage. Few prompts β memorization β entropy collapse. Short completion budget β truncation β reward ceiling. All three destroy the algorithm's fundamental mechanism: *comparing different outcomes to the same prompt*.
4. **TRL's step calculation includes a `num_generations` multiplier.** `steps = num_prompts Γ num_generations / (batch_size Γ grad_accum)`. Missing this gives wrong epoch estimates. `MAX_STEPS` always overrides `NUM_EPOCHS`.
5. **Early stopping parameters must match the model's output characteristics.** A thinking model needs 500+ tokens for `</think>`. Evaluating at 256 tokens scores incomplete generations β flat metrics β premature stop.
6. **Entropy collapse is the GRPO failure mode β not divergence, not reward hacking.** The model collapses to deterministic output. Monitoring `clip_ratio` and generation entropy is more important than monitoring reward.
7. **Calibration at inference temperature β training behavior.** Calibrating at temp=0.7 showed catastrophic results (100% ceiling hits). But actual training at temp=1.0 showed healthy dynamics (358-528 token avg, 0% ceiling). Future calibration must include a temp=1.0 pass.
8. **LoRA adapters are model-specific.** Can't transfer adapters from ThinkβBase model. Switching base model requires re-running SFT from scratch.
9. **Thinking models and structured output tasks are fundamentally in tension** when completion budgets are constrained. The `<think>` block consumes tokens that the task output needs.
10. **KV cache correctness matters.** The diagnostic cell (5b) correctly identified that KV cache was working (ratio 0.7Γ). Had it been broken (>5Γ), generation would have been catastrophically slow.
### 5.2 Process Lessons
11. **Budget 3-5 iterations, not 1.** v1 found the zero-signal bug. v2 found the temperature bug and completion ceiling. v3 addresses entropy collapse. Each iteration is cheaper because you know what to measure.
12. **Literature crawl before implementation saves compute.** The research found 6 papers on thinking control, Dr. GRPO's bias fixes, Skywork-OR1's entropy analysis, and the entire GRPO variant ecosystem β all directly applicable. Without this, you'd discover these issues empirically at $2/GPU-hour.
13. **The model family tree matters.** Discovering that `Tucano2-Think` β `Tucano2-Base` β `Qwen3-4B-Base` gave a clean non-thinking alternative with Portuguese preserved.
14. **Log everything from the start.** Moving W&B init to the beginning of the notebook means even preflight checks survive kernel disconnections.
15. **Documentation is debugging.** The project has excellent documentation (PROJECT.md, ADR-001, checkpoint logs, v3 patch spec). This made the entire investigation possible. Without docs, understanding 14.9 hours of training would require reading raw W&B logs.
### 5.3 Business Lessons
16. **Domain data is the moat, not model size.** ThinkJSON (1.5B) beats DeepSeek-R1 (671B) on JSON extraction. The 42% improvement from domain GRPO on 300 examples validates this thesis.
17. **Self-hosting economics are immediately favorable.** $0.001/analysis (GPU) vs $0.01+ (API). Breakeven at ~100 analyses/day.
18. **Portuguese-first is a defensible advantage.** Most LLM development is English-first. A model that understands Brazilian e-commerce Portuguese ("veio com defeito", "nota 1 estrela") has competitive moat.
---
## 6. Every Good Aspect of This Model/Training
### 6.1 Architecture Decisions
β
**Correct pipeline choice (SFT β GRPO).** The DeepSeek-R1 paradigm is validated by multiple papers and is the right approach for rule-based reward domains.
β
**Correct base model selection.** Qwen3-4B with Portuguese continual pretraining (Tucano2) is arguably the best available foundation for this task size. The Tucano2 paper (2603.03543) shows it achieves SOTA on Portuguese benchmarks. Using a Portuguese-specialized model instead of vanilla Qwen3 is the right call.
β
**Rule-based rewards over neural reward model.** For structured tasks with verifiable outputs (JSON schema, SQL execution), rule-based rewards are objectively superior. DeepSeek-R1 demonstrated this. Neural reward models at this scale would introduce reward hacking.
β
**4-bit quantization (NF4) via Unsloth.** Enables a 3.7B model to fit in 24GB VRAM with headroom. The VRAM budget analysis (Cell 9 smoke test) confirmed 6.8GB/23.6GB peak β massive headroom.
β
**LoRA over full fine-tuning for SFT.** With only 1,650 training samples, full fine-tuning would overfit. LoRA (r=16, Ξ±=32, 33M/3.8B trainable params = 0.87%) is appropriate.
### 6.2 Engineering Practices
β
**Gated cell execution (Cells 1-13).** Each cell is a verification gate β verify output before proceeding. This prevents cascading failures.
β
**Comprehensive diagnostic cells.** KV cache test (5b), inference test (5), reward calibration (7), smoke test (9), probe run (10) β all before committing to the full run. This is excellent practice.
β
**Weight drift validation (Cell 11 safety checks).** Testing 50 merge/unmerge cycles for LoRA weight drift, memory leak detection, and gradient flow verification. No other project I've audited does this.
β
**`UNSLOTH_COMPILE_DISABLE=1`.** Prevents Triton kernel recompilation on every `for_inference()`/`for_training()` switch. This shows understanding of Unsloth internals.
β
**Proper checkpoint management.** `save_steps=10-15`, `save_total_limit=3-5`, `save_only_model=True` β efficient disk usage with enough coverage for Spot VM preemption recovery.
β
**Multi-task reward design.** Separate reward functions for extraction, SQL, insights, and push notifications β each with domain-specific heuristics. The extraction reward scores 10 individual JSON fields with appropriate validators.
### 6.3 Research Methodology
β
**Every decision is paper-backed.** Dr. GRPO for std normalization. Skywork-OR1 for temperature. MC-GRPO for group size. ThinkJSON for the domain specialization thesis. This is research-grade engineering.
β
**Proactive issue diagnosis.** The project identified entropy collapse, completion ceiling, and data scale as root causes β not just symptoms. The analysis correctly attributes clip_ratio=0 to entropy collapse (not insufficient learning rate or wrong reward function).
β
**Clear documentation with decision log.** PROJECT.md has a formal decision log with context, problem, decision, consequence, and reference for every choice. This is ADR (Architecture Decision Record) quality.
### 6.4 Training Results
β
**+42% over SFT baseline is significant.** Going from 0.38 (SFT calibration) to 0.54 (GRPO v2 validation mean) demonstrates that GRPO is providing real value, even with all the issues.
β
**Bimodal performance reveals the problem structure.** The fact that insights/analysis (0.50-0.70) work well while extraction (0.12) doesn't tells you exactly where to focus: structured output + thinking model = the bottleneck.
β
**Zero frac_reward_zero_std after v2 fixes.** The reward engineering is correct β every group now has reward variance. The remaining issue is that advantages are too small to overcome the clip boundary.
---
## 7. Unexplored Alternatives β What You Haven't Tried Yet
### 7.1 π΄ Base Model GRPO (Highest Expected Impact)
**What:** Train GRPO starting from `Polygl0t/Tucano2-qwen-3.7B-Base` instead of `-Think`.
**Why it's unexplored:** The project committed to the Think model early and hasn't tested the Base alternative.
**Literature evidence:**
- **DeepSeek-R1-Zero:** Proved that thinking/reasoning *emerges* from RL training on base models β you don't need a pre-trained thinker.
- **ThinkJSON (2502.14905):** Qwen2.5-1.5B-**Base** + GRPO beats DeepSeek-R1-671B on JSON extraction. Base model = no `<think>` overhead = more tokens for actual output.
- **Reasoning-SQL (2503.23157):** 7B base model + GRPO beats o3-mini on SQL.
- **Your own analysis (checkpoint log):** "Every canonical GRPO paper starts from base/instruct, not thinking models."
**Expected impact:**
- Extraction score: 0.12 β 0.50+ (elimination of `<think>` overhead = JSON fits in completion budget)
- Completion efficiency: 3000 β 200-500 tokens for extraction
- Training speed: ~2Γ faster (shorter completions)
**Cost:** Requires re-running SFT (~2-4 hours on L4), then GRPO (~25 hours).
### 7.2 π΄ DAPO's Decoupled Clip (Directly Addresses Entropy Collapse)
**What:** Replace symmetric clip [1-Ξ΅, 1+Ξ΅] with asymmetric [1-Ξ΅ββw, 1+Ξ΅βα΅’gβ] where Ξ΅βα΅’gβ > Ξ΅ββw.
**Why it's unexplored:** Not available in TRL 0.24.0 as a config option. Requires trainer subclass modification.
**Literature evidence:**
- **DAPO (2503.14476) Β§3.1:** Standard symmetric clipping restricts low-probability exploration tokens far more than high-probability exploitation tokens. A token with p=0.01 can only reach 0.012, while p=0.9 can reach 1.08. Decoupled clip with Ξ΅ββw=0.2, Ξ΅βα΅’gβ=0.28 specifically allows exploration tokens to increase more.
- **Tricks or Traps (2508.08221) Β§4.2:** Independently verifies that Clip-Higher is one of the most impactful single techniques for preventing entropy collapse. Their "Lite PPO" achieves strong results with just normalization fix + Clip-Higher.
- **Your symptom matches exactly:** clip_ratio=0 means no tokens are being clipped in either direction. The upper clip is preventing exploration before the policy even reaches it.
**Expected impact:** Non-zero clip_ratio β actual policy movement β real learning signal.
### 7.3 π‘ GDPO for Multi-Task Rewards (Fixes Reward Interference)
**What:** Normalize each reward component separately before summing, then apply batch-wise normalization.
**Why it's unexplored:** Your current approach sums all reward components into a single scalar before GRPO's group normalization.
**Literature evidence:**
- **GDPO (2601.05242) Β§3.1:** When summing K rewards before normalization, distinct reward combinations collapse to identical advantages. With 4 tasks Γ 4 reward components, you're losing substantial gradient information.
- **MO-GRPO (2509.22047):** Proves (Theorem 1) that advantage correlation with each reward component is proportional to that component's standard deviation. Higher-variance rewards dominate, regardless of importance.
**Implementation:** For each prompt group, normalize each of the 4 task-specific rewards independently, then sum the normalized advantages, then apply batch-level normalization.
### 7.4 π‘ Multi-Task GRPO Dynamic Weighting (Fixes Task Imbalance)
**What:** Dynamically upweight underperforming tasks (extraction) during training.
**Why it's unexplored:** Current approach uses fixed stratified sampling (40% extraction, 40% SQL, 10% insights, 10% push) but equal reward weighting.
**Literature evidence:**
- **MT-GRPO (2602.05547):** Proposes improvement-aware weight update (IWU) that tracks per-task reward *improvement rates* and upweights tasks that are stagnating. Avoids the collapse-to-worst-task problem of naive minimax.
- **Key insight:** Use true task-level rewards for weight updates, not GRPO loss (which is ambiguous β zero loss could mean all-correct or all-incorrect).
### 7.5 π‘ Blockwise Advantage Estimation (For Structured Multi-Part Output)
**What:** Assign separate advantages to different parts of the output (think block vs. JSON/answer block).
**Why it's unexplored:** Current GRPO assigns one advantage to the entire completion.
**Literature evidence:**
- **BAE (2602.10231):** For structured generations (like `<think>...</think>JSON...`), outcome-level advantage assigns the same gradient signal to thinking tokens and answer tokens. But the thinking block's quality might differ from the answer block's quality. BAE assigns separate advantages to each block using outcome-conditioned baselines.
**Implementation:** Split completion into blocks at `</think>`. Score thinking block separately (was it concise? was it relevant?). Score answer block separately (was it correct?). Different advantages for different blocks.
### 7.6 π‘ EDGE-GRPO: Entropy-Driven Advantage (Directly Addresses Advantage Collapse)
**What:** Scale advantages by inverse normalized entropy β responses that are both correct AND confident get higher advantages.
**Why it's unexplored:** Requires computing per-response entropy during training.
**Literature evidence:**
- **EDGE-GRPO (2507.21848):** When the model generates near-identical responses, the advantages are near-zero (advantage collapse). EDA divides advantages by normalized entropy: `Γ_i = A_i / PΜ_i`. This amplifies advantages for diverse, confident-correct responses and penalizes confident-incorrect ones.
- **Also uses Guided Error Correction (GEC):** For incorrect responses, inject the correct answer 25% of the time β ensures each group contains positive examples. This is especially useful for hard tasks like extraction where the model might get 0/8 correct.
### 7.7 π’ Curriculum Learning with Progressive Context Length (Skywork-OR1)
**What:** Start training with shorter max_completion_length and progressively increase it across stages.
**Why it's unexplored:** v2 used fixed 2048, v3 uses fixed 4096.
**Literature evidence:**
- **Skywork-OR1 (2505.22312) Β§3.2.2:** Multi-stage training (progressive context length) "significantly reduces computational costs while preserving scalability." Start with 2048 β 4096 β 8192.
- **Train Long Think Short (2508.08940):** Curriculum GRPO that progressively tightens token budgets improves accuracy AND token efficiency.
### 7.8 π’ Prompt Augmentation to Scale Data
**What:** Generate paraphrased/augmented versions of existing prompts to increase effective dataset size.
**Literature evidence:**
- **Prompt Augmentation for GRPO (2602.03190):** Augmenting training prompts (rephrasing, adding context variations) enables longer training without entropy collapse. "Prompt augmentation scales up GRPO training."
- **Cocktail Effect (2410.01109):** Mixing 30% general reasoning data with domain data improves domain performance by 2-15%.
### 7.9 π’ DPO as Complement or Alternative
**What:** Use the SFT model to generate completions, score them with reward functions, and create preference pairs for DPO training.
**Why it's unexplored:** Project committed to GRPO from the start.
**Literature evidence:**
- **Iterative DPO (2503.12854):** DPO is computationally efficient and can achieve comparable results to RL for some tasks. Iteratively generating new preference pairs from the current policy and training DPO is effective.
- **Tucano2 paper (2603.03543) Β§9:** The Tucano2 Think model itself was post-trained using APO (a DPO variant) on GigaVerbo-v2 Preferences. This means the base Think model already has DPO in its training history β adding GRPO on top creates a complex interaction.
**When to use:** DPO might be more appropriate for tasks with clear good/bad pairs (extraction: valid JSON vs. invalid JSON) where you don't need the exploration that GRPO provides.
### 7.10 π’ Separate Task-Specific LoRA Adapters
**What:** Train separate LoRA adapters for each task instead of one multi-task adapter.
**Why it's unexplored:** Current approach uses one adapter for all 4 tasks.
**Rationale:** Extraction and insights have fundamentally different optimal behaviors (terse JSON vs. verbose analysis). A single adapter must compromise. Separate adapters + routing would let each task optimize independently.
### 7.11 π’ vLLM for Generation Speedup
**What:** Enable `USE_VLLM=True` in the training config.
**Why it's unexplored:** Available in the codebase (`USE_VLLM = False` in Cell 3) but disabled.
**Expected impact:** 10-20Γ generation speedup β total training time could drop from 25h to ~5-8h.
---
## 8. Literature-Backed Recommendations
### Priority Matrix
| # | Action | Expected Impact | Effort | Risk | Paper Evidence |
|---|--------|----------------|--------|------|----------------|
| 1 | **Switch to Base model** | π΄ Transformative (extraction 0.12β0.50+) | Medium (re-SFT required) | Low | ThinkJSON, DeepSeek-R1-Zero, Reasoning-SQL |
| 2 | **Implement DAPO Clip-Higher** | π΄ High (fixes clip_ratio=0) | Medium (trainer subclass) | Low | DAPO Β§3.1, Tricks or Traps Β§4.2 |
| 3 | **Add entropy bonus to loss** | π΄ High (prevents entropy collapse) | Medium (trainer subclass) | Low | Skywork-OR1 MAGIC (Eq. 3.1) |
| 4 | **GDPO reward normalization** | π‘ Moderate (fixes task interference) | Low (reward fn change) | Low | GDPO Β§3.1, MO-GRPO Theorem 1 |
| 5 | **Build formal benchmark** | π‘ Moderate (enables measurement) | Low (1-2 days) | None | β |
| 6 | **Scale to 5000+ prompts** | π‘ Moderate | Medium (data generation) | Low | Skywork-OR1, Cocktail Effect |
| 7 | **Dynamic task weighting** | π‘ Moderate (helps extraction) | Medium | Low | MT-GRPO Β§3 |
| 8 | **Enable vLLM** | π’ Low (speed only) | Low | Low | β |
| 9 | **Curriculum context length** | π’ Low-Moderate | Low | Low | Skywork-OR1 Β§3.2.2 |
| 10 | **Blockwise advantages** | π’ Low-Moderate | High | Medium | BAE (2602.10231) |
### Recommended Execution Order
```
IMMEDIATE (while v3 runs):
β Build benchmark (Phase 1 of ADR-001)
β Prepare Base model SFT data
AFTER v3 COMPLETES:
β Evaluate v3 vs v2 on benchmark
β If extraction still < 0.3: Switch to Base model
β Re-run SFT on Base model
β GRPO v4 on Base with: DAPO clip, entropy bonus, GDPO rewards,
dynamic task weighting, 5000+ prompts
IF v4 STILL SHOWS ENTROPY COLLAPSE:
β Try EDGE-GRPO (GEC + EDA)
β Try DPO as fallback for extraction/SQL specifically
DEPLOYMENT:
β Hybrid: Base model for extraction/SQL/push, Think model for insights
β Or: Single Base model with all tasks (likely better overall)
```
---
## 9. Risk Assessment
### Risks of Current v3 Run
| Risk | Probability | Impact | Mitigation |
|------|-------------|--------|------------|
| Entropy collapse persists (clip_ratio=0 after step 50) | **High** (70%) | Training produces marginal improvement | Add entropy bonus, DAPO clip in v4 |
| Think model still can't produce JSON at inference (temp=0.1) | **Very High** (90%) | Good training metrics but poor deployment | Switch to Base model |
| 25h training gets preempted on Spot VM | Medium (30%) | Lost progress | Checkpoint every 10 steps β
|
| reward_think_efficiency has no effect | **High** (60%) | Think overhead unchanged | L1 paper says RL reward needed; single run may not learn |
### Risks of Recommended Changes
| Risk | Probability | Impact | Mitigation |
|------|-------------|--------|------------|
| Base model SFT loses Portuguese quality | Low (10%) | Need to re-CPT | Tucano2-Base already has Portuguese CPT |
| DAPO clip causes training instability | Low (15%) | NaN loss | Start with Ξ΅βα΅’gβ=0.22 (conservative) |
| Data augmentation introduces noise | Medium (30%) | Reward signal degraded | Validate synthetic data quality with reward function |
---
## 10. Conclusion & Priority Roadmap
### What's Working
- The SFT β GRPO pipeline is correct and producing measurable improvements (+42%)
- The reward function engineering is solid (continuous, multi-component, calibrated)
- The infrastructure and methodology are research-grade
- Portuguese domain specialization thesis is validated
### What's Not Working
- Entropy collapse prevents real policy learning (clip_ratio=0)
- Thinking model is fundamentally incompatible with structured output under token constraints
- Data scale is 10-100Γ below published minimums
### The Single Highest-Impact Change
**Switch from Think model to Base model.** This one change addresses two of three critical issues simultaneously:
1. Eliminates the `<think>` overhead that blocks extraction/SQL output
2. Reduces completion lengths β faster training β more steps per hour
3. Aligns with every canonical GRPO paper's methodology
Combined with DAPO's Clip-Higher and Skywork-OR1's entropy bonus, this should break through the v2 performance plateau.
### 90-Day Roadmap
| Week | Action | Success Metric |
|------|--------|---------------|
| 1 | Build benchmark, evaluate v3 | Benchmark ready, v3 numbers on 80 prompts |
| 2 | SFT on Base model, GRPO v4 probe | Base SFT loss < Think SFT loss; v4 probe clip_ratio > 0 |
| 3-4 | GRPO v4 full run (Base + DAPO clip + entropy bonus + GDPO) | eval reward > 0.25; extraction > 0.40 |
| 5-6 | Scale data to 5000+, GRPO v5 | eval reward > 0.35; all tasks > 0.30 |
| 7-8 | Benchmark vs Qwen3-35B-A3B and GPT-4o | Domain parity or better on structured tasks |
| 9-12 | Production deployment, monitoring, iteration | <100ms latency, <$0.002/query, >90% uptime |
---
## Appendix A: Full Paper Reference Table
| Paper | ArXiv ID | Key Finding Used | Applied? |
|-------|----------|------------------|----------|
| DeepSeek-R1 | 2501.12948 | SFTβGRPO pipeline, rule-based rewards | β
Yes |
| Dr. GRPO | 2503.20783 | Remove std normalization, remove length bias, Ξ²=0 | β
Partially (std removed, Ξ²=0 in v3) |
| Skywork-OR1 MAGIC | 2505.22312 | Ο=1.0, entropy bonus, filter zero-advantage groups, multi-stage | β
Partially (Ο=1.0 in v3, entropy monitor, noise injection) |
| MC-GRPO | 2601.22582 | Median baseline for G=4 | β Not implemented |
| ThinkJSON | 2502.14905 | 1.5B Base + GRPO beats 671B on JSON extraction | β Insight not acted on (still using Think) |
| Reasoning-SQL | 2503.23157 | 7B Base + GRPO beats o3-mini, staged rewards | β
Partially (staged rewards in v3) |
| Cocktail Effect | 2410.01109 | 30% general data improves domain performance 2-15% | β Not implemented |
| DAPO | 2503.14476 | Decoupled clip (Clip-Higher), dynamic sampling, overlong filtering | β Not implemented |
| GDPO | 2601.05242 | Decoupled reward normalization preserves fine-grained advantages | β Not implemented |
| MO-GRPO | 2509.22047 | Variance-based reward dominance in multi-reward GRPO | β Not implemented |
| MT-GRPO | 2602.05547 | Dynamic task weighting for balanced multi-task GRPO | β Not implemented |
| EDGE-GRPO | 2507.21848 | Entropy-driven advantage + guided error correction | β Not implemented |
| BAE | 2602.10231 | Blockwise advantages for structured output | β Not implemented |
| Tricks or Traps | 2508.08221 | Local mean + global std for normalization; Clip-Higher verified | β Not implemented |
| RL-Struct | 2512.00319 | Multi-dimensional reward for JSON (structure, format, validity, correctness, length) | β
Similar approach used |
| Prompt Augmentation | 2602.03190 | Prompt augmentation overcomes entropy collapse | β Not implemented |
| Train Long Think Short | 2508.08940 | Progressive token budgets via curriculum | β Not implemented |
| OptimalThinkingBench | 2508.13141 | "Don't overthink" prompts | β
Applied in v3 |
| L1 | 2503.04697 | Token budgets require RL training to work | β
Applied in v3 (reward_think_efficiency) |
| Tucano2 | 2603.03543 | Portuguese CPT on Qwen3, GigaVerbo-v2 datasets | β
Base model used |
## Appendix B: Repository Structure Summary
```
rtferraz/tucano2-commerce/
βββ docs/
β βββ PROJECT.md # Comprehensive project documentation
β βββ ADR-001-next-steps.md # Detailed execution plans (benchmark, comparison, v3)
β βββ v3_thinking_control_patch.md # Task-aware thinking control spec
β βββ INVESTIGATION_REPORT.md # β THIS FILE
β βββ checkpoints/
β βββ 2026-04-23_v3-launch.md # v3 launch checkpoint with probe results
βββ notebooks/
β βββ grpo_vertex_v3.ipynb # v3 training notebook (running on Vertex AI)
βββ scripts/
β βββ md_to_ipynb.py # Markdown β notebook converter
βββ grpo_vertex_v2_ipynb.md # v2 reference notebook with all outputs
βββ .gitignore
rtferraz/commerce-model-qwen3.5-lora/
βββ adapter_config.json # LoRA r=16, Ξ±=32, Qwen3.5-9B
βββ adapter_model.safetensors # 111MB adapter weights
βββ chat_template.jinja
βββ processor_config.json
βββ tokenizer.json
βββ tokenizer_config.json
rtferraz/parameter-golf-v2/
βββ ANALYSIS.md # Competition gap analysis
βββ train_final.py # Full training script (SP8192+PAF+TTT+Int6)
βββ train_gpt2.py # Earlier GPT-2 based attempt
```
---
*Report generated on 2026-04-25 by automated investigation of all project artifacts, cross-referenced with 20+ published papers.* |