Tucano2-Commerce: Comprehensive Project Investigation Report
Date: 2026-04-25
Scope: Full audit of training performance, identified issues, unexplored alternatives, and actionable recommendations
Repositories Audited:
rtferraz/tucano2-commerceβ Main project repo (docs, notebooks, scripts)rtferraz/commerce-model-qwen3.5-loraβ Qwen3.5-9B SFT LoRA adapterrtferraz/parameter-golf-v2β Separate competition project (parameter-efficient LM)
Table of Contents
- Executive Summary
- Project Architecture & Timeline
- Every Change That Improved Performance
- Every Issue That Needs Improvement
- Every Invaluable Lesson Learned
- Every Good Aspect of This Model/Training
- Unexplored Alternatives β What You Haven't Tried Yet
- Literature-Backed Recommendations
- Risk Assessment
- Conclusion & Priority Roadmap
1. Executive Summary
The Tucano2-Commerce project aims to build a compact (3.7B parameter) domain-specialized LLM for Brazilian e-commerce analysis β sentiment, JSON extraction, SQL generation, churn prediction, and business insights β all in Portuguese. The pipeline follows the DeepSeek-R1 paradigm: Base β SFT β GRPO.
Key Numbers
| Metric | Value |
|---|---|
| Base model | Polygl0t/Tucano2-qwen-3.7B-Think (Qwen3-4B β Portuguese CPT β SFT+Think) |
| SFT data | ~1,650 domain-specific samples |
| GRPO v2 data | 300 prompts (subset) |
| GRPO v3 data | ~1,404 prompts (full) |
| v2 best eval reward | 0.125 (eval) / 0.54 (validation mean) β +42% over SFT baseline |
| v2 training steps | 210/300 (early stopped) |
| v2 duration | 14.9 hours on NVIDIA L4 |
| v3 status | Launched (~500 steps, ~25h estimated) |
| Critical issues | Entropy collapse, completion length ceiling, thinking model overhead |
| Hardware | Single NVIDIA L4 (24GB VRAM) |
Verdict
The project demonstrates strong engineering discipline and research-driven decision-making. The +42% improvement over SFT baseline is real. However, three structural issues β entropy collapse, thinking model incompatibility with structured output, and data scale β are capping performance. The v3 run addresses these partially, but the literature points to several unexplored approaches that could yield substantially better results.
2. Project Architecture & Timeline
Model Lineage
Qwen/Qwen3-4B-Base
ββ Polygl0t/Tucano2-qwen-3.7B-Base (Portuguese continual pretraining, 320B tok corpus)
ββ Polygl0t/Tucano2-qwen-3.7B-Think (SFT + thinking training, GigaVerbo-v2)
ββ YOUR SFT adapter (domain e-commerce, ~1,650 samples)
ββ GRPO v1 (first attempt β killed, zero-signal bug)
ββ GRPO v2 (210 steps, +42% over SFT)
ββ GRPO v3 (launched, all fixes from ADR-001)
Separate Project: Qwen3.5-9B LoRA
Qwen/Qwen3.5-9B
ββ rtferraz/commerce-model-qwen3.5-lora (LoRA: r=16, Ξ±=32, 111MB adapter)
This is a separate SFT experiment on a larger model (9B). The adapter config shows standard LoRA targeting all linear layers (q,k,v,o,gate,up,down projections) with r=16, Ξ±=32, no dropout. No training metrics or README details were saved β only the default Unsloth template.
Separate Project: Parameter Golf v2
A competition entry for parameter-efficient language modeling (BPB metric). Uses Int6 GPTQ quantization, SP8192 tokenizer, parallel residual architecture, depth recurrence, Muon optimizer, and TTT (test-time training). Sophisticated work β shows strong systems engineering capability.
3. Every Change That Improved Performance
3.1 Binary β Continuous Reward Functions β (+50% training signal)
| Before | After | Evidence |
|---|---|---|
| Binary rewards (0/1) | Continuous rewards (0.0-1.0) with partial credit | reward_std=0 dropped from 50% to ~10% of steps |
Why it worked: Binary rewards create groups where all completions get 0 or all get 1. With GRPO's group-relative normalization, zero variance β zero advantage β zero gradient. Continuous rewards ensure reward variance exists in nearly every group.
Paper backing: Dr. GRPO (2503.20783) Β§3.1 proves that std-based normalization amplifies this problem β groups with low std get inflated gradient contributions.
3.2 Temperature 0.1 β 0.8 β (Training went from non-functional to functional)
| Before | After | Evidence |
|---|---|---|
temp=0.1 (Qwen3 default in generation_config.json) |
temp=0.8 | frac_reward_zero_std went from 1.0 (every step) to ~0.0 |
Why it worked: Low temperature makes all G=8 rollouts near-identical β zero reward variance β zero advantage β zero gradient. This was the single most destructive bug β the entire v1 run produced zero learning signal.
Paper backing: Skywork-OR1 (2505.22312) Β§3.1: Ο=1.0 "enhances exploration capability and improves learning plasticity." Their ablation (Β§3.2.4) shows Ο=0.6 immediately enters low-entropy state.
3.3 scale_rewards=False (Dr. GRPO fix) β
| Before | After | Evidence |
|---|---|---|
| Default GRPO std normalization | Removed std normalization | More stable training; eliminated most zero-gradient steps |
Why it worked: Standard GRPO divides advantages by std(rewards) per group. When a group has near-uniform rewards, the small std inflates gradients β training instability + bias toward "easy" prompts (the difficulty bias).
Paper backing: Dr. GRPO (2503.20783) Β§3.1 formally proves this bias and shows removing it achieves SOTA 43.3% on AIME 2024 with 7B.
3.4 EVAL_MAX_TOKENS 256 β 2048 β (Prevented premature early stopping)
| Before | After | Evidence |
|---|---|---|
| 256 eval tokens | 2048 eval tokens | Training ran to 210 steps vs. killed at step 40 |
Why it worked: The Think model needs 500-700+ tokens just for </think>. At 256 tokens, eval always scored incomplete generations β flat eval metrics β early stopping fired after 3 evals β killed training at step 40.
3.5 Early Stopping Patience 3 β 10 β
| Before | After | Evidence |
|---|---|---|
| 3 consecutive evals | 10 consecutive evals | 100 steps of runway before halt (was 30) |
Why it worked: GRPO training is noisy β reward doesn't monotonically improve. Patience=3 was too aggressive.
3.6 UnslothGRPOTrainer Wrapper β
(~2-3Γ generation speedup)
Wraps _generate() with for_inference()/for_training() to activate Unsloth's optimized Triton kernels during generation. Without this: ~3-4 tok/s. With: ~8-15 tok/s on L4.
3.7 processing_class=tokenizer Fix β
In TRL 0.24.0, passing tokenizer=tokenizer to GRPOTrainer was silently dropped. Changed to processing_class=tokenizer. Without this fix, the eval callback received None as tokenizer.
3.8 Reward Normalization (extraction reward capped to 1.0) β
The extraction reward function originally scored up to 2.0 while others maxed at 1.0 β extraction gradients were 2Γ larger β biased optimization toward extraction at the expense of other tasks.
Paper backing: MO-GRPO (2509.22047) Theorem 1 proves that GRPO advantages are more correlated with higher-variance reward components. Unnormalized rewards cause exactly this.
3.9 v3 Changes (Launched, Awaiting Results)
| Change | From | To | Paper |
|---|---|---|---|
| Temperature | 0.8 | 1.0 | Skywork-OR1 |
| max_completion_length | 2048 | 4096 | Dr. GRPO |
| num_generations | 8 | 4 | MC-GRPO (VRAM tradeoff) |
| learning_rate | 5e-7 | 2e-6 | Dr. GRPO Appendix G |
| Ξ² (KL penalty) | implicit | 0.0 | Dr. GRPO Β§3.2 |
| Training data | 300 subset | ~1,400 (all) | Scale fix |
| System prompts | generic | 4 task-aware | OptimalThinkingBench |
| Think efficiency reward | none | reward_think_efficiency() |
L1 paper |
| Zero-advantage groups | included | noise injection (Ο=0.005) | Skywork-OR1 |
| grad_accum | 2 | 1 | Effective batch 4 |
4. Every Issue That Needs Improvement
4.1 π΄ CRITICAL: Entropy Collapse (clip_ratio=0 on ALL steps)
Evidence: v2 logs show clip_ratio=0 on every single training step. KL divergence = 0.004. The policy barely moved from the SFT initialization.
What this means: The PPO clipping mechanism is designed to prevent the policy from moving too far. But clip_ratio=0 means the policy never even approached the clipping boundary β it's not that clipping is preventing movement, it's that the policy has no gradient signal large enough to push it.
Root cause analysis (with paper evidence):
DAPO (2503.14476) Β§3.1 β Clip-Higher: Standard PPO clips at [1-Ξ΅, 1+Ξ΅]. For low-probability "exploration" tokens (p=0.01), the upper bound is only 0.012 β the token can barely increase its probability. Meanwhile, high-probability "exploitation" tokens (p=0.9) can go to 1.08. This asymmetry means the upper clip restricts exploration far more than exploitation. DAPO proposes decoupled clip with Ξ΅ββw=0.2, Ξ΅βα΅’gβ=0.28 β wider upper clip to encourage exploration.
Skywork-OR1 (2505.22312) Β§4: On-policy training significantly slows entropy collapse. Off-policy updates (multiple gradient steps per rollout) accelerate it. Your current setup does 1 gradient step per rollout (on-policy) β this is correct but insufficient without the entropy bonus.
EDGE-GRPO (2507.21848): Even with temperature=1.0, models can still collapse to near-deterministic output. The paper proposes Entropy-Driven Advantage (EDA) β dividing advantages by normalized per-response entropy, which amplifies the advantage of diverse responses.
How to fix:
- Add explicit entropy bonus to loss (Skywork-OR1 MAGIC loss, Eq. 3.1):
Ξ±_k * H_ij^t(ΞΈ)where Ξ± starts at 5e-3 and decays. This requires modifying the loss function. - Implement DAPO's Clip-Higher: Set Ξ΅ββw=0.2, Ξ΅βα΅’gβ=0.28 (or even higher). This is a TRL config change if supported, or requires trainer subclass.
- Filter zero-advantage groups completely (Skywork-OR1 Β§3.1), not just add noise. Remove entire prompts where all G completions get identical rewards.
4.2 π΄ CRITICAL: Thinking Model Incompatibility with Structured Output
Evidence:
- v2 calibration: 8/8 samples hit 2048 ceiling
- v3 calibration (temp=0.7): 8/8 samples hit 4096 ceiling, both extraction samples stuck in
<think> - Prompt-level control ("NΓ£o pense em excesso") had zero measurable effect at inference time
- L1 paper (2503.04697) confirms: untrained models ignore length instructions
Root cause: The Think model's chat template always injects <think> on the last assistant turn β there is no enable_thinking conditional (unlike official Qwen3-4B). The model was trained to think extensively, and this behavior is deeply embedded in its weights.
Why this matters for extraction/SQL:
- Extraction needs ~50-100 tokens of output (JSON). The model produces 2000-3000 tokens of
<think>first. - At temp=0.1 (inference), the model deterministically fills the entire context with thinking.
- At temp=1.0 (training), completions are shorter (358-528 tokens avg) β but this creates a train-test distribution mismatch.
How to fix:
- Switch to Base model β
Polygl0t/Tucano2-qwen-3.7B-Base. Every canonical GRPO paper starts from base/instruct, not thinking models. DeepSeek-R1-Zero proved thinking emerges from RL. ThinkJSON (2502.14905) beats R1-671B on JSON extraction using Qwen2.5-1.5B-Base + GRPO. This requires re-running SFT (LoRA adapters are model-specific). - Hybrid deployment β Use Think model for insights (where thinking adds value), Base model for extraction/SQL/push (where thinking hurts).
- Modify chat template β Fork the template to conditionally disable
<think>injection for extraction/push tasks. This is a workaround, not a fix.
4.3 π‘ MODERATE: Data Scale (300β1400, Still Below Literature Minimum)
Evidence:
- v2: 300 prompts β early stopping at step 210 (70% of one epoch)
- v3: ~1,400 prompts β 500 steps planned
- Literature minimum: Skywork-OR1 uses 30K+ prompts. DeepSeek-R1 uses 600K+.
Why this matters: With 1,400 prompts, the model sees each prompt only once. There's no second-epoch reinforcement. The reward signal is thin β each task type has only 100-650 examples.
How to fix:
- Synthetic data augmentation using GPT-4o or the SFT model itself (planned in ADR-001)
- Data mixing with general reasoning β Cocktail Effect paper (2410.01109) shows 30% general data improves domain by 2-15%
- Target 5,000+ prompts for meaningful multi-epoch training
4.4 π‘ MODERATE: Multi-Task Reward Interference
Evidence:
- Bimodal performance: insights/analysis (0.50-0.70) vs. extraction (0.12)
- Extraction reward was previously 2Γ the scale of other rewards (fixed in v2)
- v3 uses single composite reward summing all components
Root cause (paper evidence):
MO-GRPO (2509.22047) Theorem 1: In standard GRPO, the advantage function is more correlated with reward components that have higher variance. If
reward_insightshas variance 0.1 butreward_extractionhas variance 0.01 (because extraction either works or doesn't), GRPO will preferentially optimize for insights.GDPO (2601.05242) Β§3.1: When GRPO sums multiple rewards before normalization, distinct reward combinations can map to identical advantages β losing information. E.g., (format=0, content=1) and (format=1, content=0) both sum to 1 β same advantage, despite being completely different errors.
Multi-Task GRPO (2602.05547) Β§3: Standard average reward maximization can allow large gains on easy tasks to compensate for stagnation on hard tasks. Their formulation explicitly bounds inter-task performance disparity.
How to fix:
- GDPO: Normalize each reward component separately before summing. This preserves fine-grained advantage distinctions.
- Multi-Task GRPO: Dynamic task weighting that upweights underperforming tasks (extraction) and downweights saturating tasks (insights).
- Conditional rewards: Gate easier rewards (format) on harder ones (content accuracy). Model only receives format reward if content is above a threshold (GDPO Β§3.2, Eq. 8).
4.5 π‘ MODERATE: No Formal Benchmark
Evidence: Evaluation uses 5 held-out prompts scored by the reward function itself. There's no independent benchmark with ground truth, no comparison against baselines (Qwen3-3.7B base, GPT-4o), no standardized metrics.
How to fix: Phase 1 of ADR-001 is well-designed (80 prompts, per-task scorers, multiple baselines). Execute it.
4.6 π’ MINOR: TRL 0.24.0 Lock
Evidence: Pinned to TRL 0.24.0 for Unsloth compatibility. Newer TRL versions have:
- Native
entropy_coeffin GRPOConfig - Better logging (clip ratios per-positive/negative)
- Bug fixes for generation config handling
How to fix: Either upgrade Unsloth or implement needed features via callbacks/trainer subclass (v3 already does this for entropy monitoring).
4.7 π’ MINOR: Single GPU Training Bottleneck
Evidence:
- Smoke test: 318s/step β 13.2h for 75 steps
- v2 full run: ~4.3 min/step β 14.9h for 210 steps
- v3 estimated: ~3 min/step β 25h for 500 steps
With G=4 and max_completion_length=4096, generation dominates training time. vLLM was available but not used (USE_VLLM=False).
How to fix:
- Enable vLLM colocate mode for faster generation
- Consider multi-GPU setup (2ΓL4 or A100) for generation parallelism
5. Every Invaluable Lesson Learned
5.1 Technical Lessons
Default model generation configs will silently destroy your RL training. Qwen3's
generation_config.jsonsetstemperature=0.1. This single default was responsible for the complete failure of v1. Always explicitly override every generation parameter.The reward function is the product specification. Binary rewards β zero signal. Continuous rewards with partial credit β training works. Multi-component rewards with staged convergence β format learns first, content follows. The time spent designing rewards is the most valuable engineering time.
GRPO needs diversity to learn β diversity in completions AND diversity in prompts. Low temperature β identical completions β zero advantage. Few prompts β memorization β entropy collapse. Short completion budget β truncation β reward ceiling. All three destroy the algorithm's fundamental mechanism: comparing different outcomes to the same prompt.
TRL's step calculation includes a
num_generationsmultiplier.steps = num_prompts Γ num_generations / (batch_size Γ grad_accum). Missing this gives wrong epoch estimates.MAX_STEPSalways overridesNUM_EPOCHS.Early stopping parameters must match the model's output characteristics. A thinking model needs 500+ tokens for
</think>. Evaluating at 256 tokens scores incomplete generations β flat metrics β premature stop.Entropy collapse is the GRPO failure mode β not divergence, not reward hacking. The model collapses to deterministic output. Monitoring
clip_ratioand generation entropy is more important than monitoring reward.Calibration at inference temperature β training behavior. Calibrating at temp=0.7 showed catastrophic results (100% ceiling hits). But actual training at temp=1.0 showed healthy dynamics (358-528 token avg, 0% ceiling). Future calibration must include a temp=1.0 pass.
LoRA adapters are model-specific. Can't transfer adapters from ThinkβBase model. Switching base model requires re-running SFT from scratch.
Thinking models and structured output tasks are fundamentally in tension when completion budgets are constrained. The
<think>block consumes tokens that the task output needs.KV cache correctness matters. The diagnostic cell (5b) correctly identified that KV cache was working (ratio 0.7Γ). Had it been broken (>5Γ), generation would have been catastrophically slow.
5.2 Process Lessons
Budget 3-5 iterations, not 1. v1 found the zero-signal bug. v2 found the temperature bug and completion ceiling. v3 addresses entropy collapse. Each iteration is cheaper because you know what to measure.
Literature crawl before implementation saves compute. The research found 6 papers on thinking control, Dr. GRPO's bias fixes, Skywork-OR1's entropy analysis, and the entire GRPO variant ecosystem β all directly applicable. Without this, you'd discover these issues empirically at $2/GPU-hour.
The model family tree matters. Discovering that
Tucano2-ThinkβTucano2-BaseβQwen3-4B-Basegave a clean non-thinking alternative with Portuguese preserved.Log everything from the start. Moving W&B init to the beginning of the notebook means even preflight checks survive kernel disconnections.
Documentation is debugging. The project has excellent documentation (PROJECT.md, ADR-001, checkpoint logs, v3 patch spec). This made the entire investigation possible. Without docs, understanding 14.9 hours of training would require reading raw W&B logs.
5.3 Business Lessons
Domain data is the moat, not model size. ThinkJSON (1.5B) beats DeepSeek-R1 (671B) on JSON extraction. The 42% improvement from domain GRPO on 300 examples validates this thesis.
Self-hosting economics are immediately favorable. $0.001/analysis (GPU) vs $0.01+ (API). Breakeven at ~100 analyses/day.
Portuguese-first is a defensible advantage. Most LLM development is English-first. A model that understands Brazilian e-commerce Portuguese ("veio com defeito", "nota 1 estrela") has competitive moat.
6. Every Good Aspect of This Model/Training
6.1 Architecture Decisions
β Correct pipeline choice (SFT β GRPO). The DeepSeek-R1 paradigm is validated by multiple papers and is the right approach for rule-based reward domains.
β Correct base model selection. Qwen3-4B with Portuguese continual pretraining (Tucano2) is arguably the best available foundation for this task size. The Tucano2 paper (2603.03543) shows it achieves SOTA on Portuguese benchmarks. Using a Portuguese-specialized model instead of vanilla Qwen3 is the right call.
β Rule-based rewards over neural reward model. For structured tasks with verifiable outputs (JSON schema, SQL execution), rule-based rewards are objectively superior. DeepSeek-R1 demonstrated this. Neural reward models at this scale would introduce reward hacking.
β 4-bit quantization (NF4) via Unsloth. Enables a 3.7B model to fit in 24GB VRAM with headroom. The VRAM budget analysis (Cell 9 smoke test) confirmed 6.8GB/23.6GB peak β massive headroom.
β LoRA over full fine-tuning for SFT. With only 1,650 training samples, full fine-tuning would overfit. LoRA (r=16, Ξ±=32, 33M/3.8B trainable params = 0.87%) is appropriate.
6.2 Engineering Practices
β Gated cell execution (Cells 1-13). Each cell is a verification gate β verify output before proceeding. This prevents cascading failures.
β Comprehensive diagnostic cells. KV cache test (5b), inference test (5), reward calibration (7), smoke test (9), probe run (10) β all before committing to the full run. This is excellent practice.
β Weight drift validation (Cell 11 safety checks). Testing 50 merge/unmerge cycles for LoRA weight drift, memory leak detection, and gradient flow verification. No other project I've audited does this.
β
UNSLOTH_COMPILE_DISABLE=1. Prevents Triton kernel recompilation on every for_inference()/for_training() switch. This shows understanding of Unsloth internals.
β
Proper checkpoint management. save_steps=10-15, save_total_limit=3-5, save_only_model=True β efficient disk usage with enough coverage for Spot VM preemption recovery.
β Multi-task reward design. Separate reward functions for extraction, SQL, insights, and push notifications β each with domain-specific heuristics. The extraction reward scores 10 individual JSON fields with appropriate validators.
6.3 Research Methodology
β Every decision is paper-backed. Dr. GRPO for std normalization. Skywork-OR1 for temperature. MC-GRPO for group size. ThinkJSON for the domain specialization thesis. This is research-grade engineering.
β Proactive issue diagnosis. The project identified entropy collapse, completion ceiling, and data scale as root causes β not just symptoms. The analysis correctly attributes clip_ratio=0 to entropy collapse (not insufficient learning rate or wrong reward function).
β Clear documentation with decision log. PROJECT.md has a formal decision log with context, problem, decision, consequence, and reference for every choice. This is ADR (Architecture Decision Record) quality.
6.4 Training Results
β +42% over SFT baseline is significant. Going from 0.38 (SFT calibration) to 0.54 (GRPO v2 validation mean) demonstrates that GRPO is providing real value, even with all the issues.
β Bimodal performance reveals the problem structure. The fact that insights/analysis (0.50-0.70) work well while extraction (0.12) doesn't tells you exactly where to focus: structured output + thinking model = the bottleneck.
β Zero frac_reward_zero_std after v2 fixes. The reward engineering is correct β every group now has reward variance. The remaining issue is that advantages are too small to overcome the clip boundary.
7. Unexplored Alternatives β What You Haven't Tried Yet
7.1 π΄ Base Model GRPO (Highest Expected Impact)
What: Train GRPO starting from Polygl0t/Tucano2-qwen-3.7B-Base instead of -Think.
Why it's unexplored: The project committed to the Think model early and hasn't tested the Base alternative.
Literature evidence:
- DeepSeek-R1-Zero: Proved that thinking/reasoning emerges from RL training on base models β you don't need a pre-trained thinker.
- ThinkJSON (2502.14905): Qwen2.5-1.5B-Base + GRPO beats DeepSeek-R1-671B on JSON extraction. Base model = no
<think>overhead = more tokens for actual output. - Reasoning-SQL (2503.23157): 7B base model + GRPO beats o3-mini on SQL.
- Your own analysis (checkpoint log): "Every canonical GRPO paper starts from base/instruct, not thinking models."
Expected impact:
- Extraction score: 0.12 β 0.50+ (elimination of
<think>overhead = JSON fits in completion budget) - Completion efficiency: 3000 β 200-500 tokens for extraction
- Training speed: ~2Γ faster (shorter completions)
Cost: Requires re-running SFT (2-4 hours on L4), then GRPO (25 hours).
7.2 π΄ DAPO's Decoupled Clip (Directly Addresses Entropy Collapse)
What: Replace symmetric clip [1-Ξ΅, 1+Ξ΅] with asymmetric [1-Ξ΅ββw, 1+Ξ΅βα΅’gβ] where Ξ΅βα΅’gβ > Ξ΅ββw.
Why it's unexplored: Not available in TRL 0.24.0 as a config option. Requires trainer subclass modification.
Literature evidence:
- DAPO (2503.14476) Β§3.1: Standard symmetric clipping restricts low-probability exploration tokens far more than high-probability exploitation tokens. A token with p=0.01 can only reach 0.012, while p=0.9 can reach 1.08. Decoupled clip with Ξ΅ββw=0.2, Ξ΅βα΅’gβ=0.28 specifically allows exploration tokens to increase more.
- Tricks or Traps (2508.08221) Β§4.2: Independently verifies that Clip-Higher is one of the most impactful single techniques for preventing entropy collapse. Their "Lite PPO" achieves strong results with just normalization fix + Clip-Higher.
- Your symptom matches exactly: clip_ratio=0 means no tokens are being clipped in either direction. The upper clip is preventing exploration before the policy even reaches it.
Expected impact: Non-zero clip_ratio β actual policy movement β real learning signal.
7.3 π‘ GDPO for Multi-Task Rewards (Fixes Reward Interference)
What: Normalize each reward component separately before summing, then apply batch-wise normalization.
Why it's unexplored: Your current approach sums all reward components into a single scalar before GRPO's group normalization.
Literature evidence:
- GDPO (2601.05242) Β§3.1: When summing K rewards before normalization, distinct reward combinations collapse to identical advantages. With 4 tasks Γ 4 reward components, you're losing substantial gradient information.
- MO-GRPO (2509.22047): Proves (Theorem 1) that advantage correlation with each reward component is proportional to that component's standard deviation. Higher-variance rewards dominate, regardless of importance.
Implementation: For each prompt group, normalize each of the 4 task-specific rewards independently, then sum the normalized advantages, then apply batch-level normalization.
7.4 π‘ Multi-Task GRPO Dynamic Weighting (Fixes Task Imbalance)
What: Dynamically upweight underperforming tasks (extraction) during training.
Why it's unexplored: Current approach uses fixed stratified sampling (40% extraction, 40% SQL, 10% insights, 10% push) but equal reward weighting.
Literature evidence:
- MT-GRPO (2602.05547): Proposes improvement-aware weight update (IWU) that tracks per-task reward improvement rates and upweights tasks that are stagnating. Avoids the collapse-to-worst-task problem of naive minimax.
- Key insight: Use true task-level rewards for weight updates, not GRPO loss (which is ambiguous β zero loss could mean all-correct or all-incorrect).
7.5 π‘ Blockwise Advantage Estimation (For Structured Multi-Part Output)
What: Assign separate advantages to different parts of the output (think block vs. JSON/answer block).
Why it's unexplored: Current GRPO assigns one advantage to the entire completion.
Literature evidence:
- BAE (2602.10231): For structured generations (like
<think>...</think>JSON...), outcome-level advantage assigns the same gradient signal to thinking tokens and answer tokens. But the thinking block's quality might differ from the answer block's quality. BAE assigns separate advantages to each block using outcome-conditioned baselines.
Implementation: Split completion into blocks at </think>. Score thinking block separately (was it concise? was it relevant?). Score answer block separately (was it correct?). Different advantages for different blocks.
7.6 π‘ EDGE-GRPO: Entropy-Driven Advantage (Directly Addresses Advantage Collapse)
What: Scale advantages by inverse normalized entropy β responses that are both correct AND confident get higher advantages.
Why it's unexplored: Requires computing per-response entropy during training.
Literature evidence:
- EDGE-GRPO (2507.21848): When the model generates near-identical responses, the advantages are near-zero (advantage collapse). EDA divides advantages by normalized entropy:
Γ_i = A_i / PΜ_i. This amplifies advantages for diverse, confident-correct responses and penalizes confident-incorrect ones. - Also uses Guided Error Correction (GEC): For incorrect responses, inject the correct answer 25% of the time β ensures each group contains positive examples. This is especially useful for hard tasks like extraction where the model might get 0/8 correct.
7.7 π’ Curriculum Learning with Progressive Context Length (Skywork-OR1)
What: Start training with shorter max_completion_length and progressively increase it across stages.
Why it's unexplored: v2 used fixed 2048, v3 uses fixed 4096.
Literature evidence:
- Skywork-OR1 (2505.22312) Β§3.2.2: Multi-stage training (progressive context length) "significantly reduces computational costs while preserving scalability." Start with 2048 β 4096 β 8192.
- Train Long Think Short (2508.08940): Curriculum GRPO that progressively tightens token budgets improves accuracy AND token efficiency.
7.8 π’ Prompt Augmentation to Scale Data
What: Generate paraphrased/augmented versions of existing prompts to increase effective dataset size.
Literature evidence:
- Prompt Augmentation for GRPO (2602.03190): Augmenting training prompts (rephrasing, adding context variations) enables longer training without entropy collapse. "Prompt augmentation scales up GRPO training."
- Cocktail Effect (2410.01109): Mixing 30% general reasoning data with domain data improves domain performance by 2-15%.
7.9 π’ DPO as Complement or Alternative
What: Use the SFT model to generate completions, score them with reward functions, and create preference pairs for DPO training.
Why it's unexplored: Project committed to GRPO from the start.
Literature evidence:
- Iterative DPO (2503.12854): DPO is computationally efficient and can achieve comparable results to RL for some tasks. Iteratively generating new preference pairs from the current policy and training DPO is effective.
- Tucano2 paper (2603.03543) Β§9: The Tucano2 Think model itself was post-trained using APO (a DPO variant) on GigaVerbo-v2 Preferences. This means the base Think model already has DPO in its training history β adding GRPO on top creates a complex interaction.
When to use: DPO might be more appropriate for tasks with clear good/bad pairs (extraction: valid JSON vs. invalid JSON) where you don't need the exploration that GRPO provides.
7.10 π’ Separate Task-Specific LoRA Adapters
What: Train separate LoRA adapters for each task instead of one multi-task adapter.
Why it's unexplored: Current approach uses one adapter for all 4 tasks.
Rationale: Extraction and insights have fundamentally different optimal behaviors (terse JSON vs. verbose analysis). A single adapter must compromise. Separate adapters + routing would let each task optimize independently.
7.11 π’ vLLM for Generation Speedup
What: Enable USE_VLLM=True in the training config.
Why it's unexplored: Available in the codebase (USE_VLLM = False in Cell 3) but disabled.
Expected impact: 10-20Γ generation speedup β total training time could drop from 25h to ~5-8h.
8. Literature-Backed Recommendations
Priority Matrix
| # | Action | Expected Impact | Effort | Risk | Paper Evidence |
|---|---|---|---|---|---|
| 1 | Switch to Base model | π΄ Transformative (extraction 0.12β0.50+) | Medium (re-SFT required) | Low | ThinkJSON, DeepSeek-R1-Zero, Reasoning-SQL |
| 2 | Implement DAPO Clip-Higher | π΄ High (fixes clip_ratio=0) | Medium (trainer subclass) | Low | DAPO Β§3.1, Tricks or Traps Β§4.2 |
| 3 | Add entropy bonus to loss | π΄ High (prevents entropy collapse) | Medium (trainer subclass) | Low | Skywork-OR1 MAGIC (Eq. 3.1) |
| 4 | GDPO reward normalization | π‘ Moderate (fixes task interference) | Low (reward fn change) | Low | GDPO Β§3.1, MO-GRPO Theorem 1 |
| 5 | Build formal benchmark | π‘ Moderate (enables measurement) | Low (1-2 days) | None | β |
| 6 | Scale to 5000+ prompts | π‘ Moderate | Medium (data generation) | Low | Skywork-OR1, Cocktail Effect |
| 7 | Dynamic task weighting | π‘ Moderate (helps extraction) | Medium | Low | MT-GRPO Β§3 |
| 8 | Enable vLLM | π’ Low (speed only) | Low | Low | β |
| 9 | Curriculum context length | π’ Low-Moderate | Low | Low | Skywork-OR1 Β§3.2.2 |
| 10 | Blockwise advantages | π’ Low-Moderate | High | Medium | BAE (2602.10231) |
Recommended Execution Order
IMMEDIATE (while v3 runs):
β Build benchmark (Phase 1 of ADR-001)
β Prepare Base model SFT data
AFTER v3 COMPLETES:
β Evaluate v3 vs v2 on benchmark
β If extraction still < 0.3: Switch to Base model
β Re-run SFT on Base model
β GRPO v4 on Base with: DAPO clip, entropy bonus, GDPO rewards,
dynamic task weighting, 5000+ prompts
IF v4 STILL SHOWS ENTROPY COLLAPSE:
β Try EDGE-GRPO (GEC + EDA)
β Try DPO as fallback for extraction/SQL specifically
DEPLOYMENT:
β Hybrid: Base model for extraction/SQL/push, Think model for insights
β Or: Single Base model with all tasks (likely better overall)
9. Risk Assessment
Risks of Current v3 Run
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| Entropy collapse persists (clip_ratio=0 after step 50) | High (70%) | Training produces marginal improvement | Add entropy bonus, DAPO clip in v4 |
| Think model still can't produce JSON at inference (temp=0.1) | Very High (90%) | Good training metrics but poor deployment | Switch to Base model |
| 25h training gets preempted on Spot VM | Medium (30%) | Lost progress | Checkpoint every 10 steps β |
| reward_think_efficiency has no effect | High (60%) | Think overhead unchanged | L1 paper says RL reward needed; single run may not learn |
Risks of Recommended Changes
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| Base model SFT loses Portuguese quality | Low (10%) | Need to re-CPT | Tucano2-Base already has Portuguese CPT |
| DAPO clip causes training instability | Low (15%) | NaN loss | Start with Ξ΅βα΅’gβ=0.22 (conservative) |
| Data augmentation introduces noise | Medium (30%) | Reward signal degraded | Validate synthetic data quality with reward function |
10. Conclusion & Priority Roadmap
What's Working
- The SFT β GRPO pipeline is correct and producing measurable improvements (+42%)
- The reward function engineering is solid (continuous, multi-component, calibrated)
- The infrastructure and methodology are research-grade
- Portuguese domain specialization thesis is validated
What's Not Working
- Entropy collapse prevents real policy learning (clip_ratio=0)
- Thinking model is fundamentally incompatible with structured output under token constraints
- Data scale is 10-100Γ below published minimums
The Single Highest-Impact Change
Switch from Think model to Base model. This one change addresses two of three critical issues simultaneously:
- Eliminates the
<think>overhead that blocks extraction/SQL output - Reduces completion lengths β faster training β more steps per hour
- Aligns with every canonical GRPO paper's methodology
Combined with DAPO's Clip-Higher and Skywork-OR1's entropy bonus, this should break through the v2 performance plateau.
90-Day Roadmap
| Week | Action | Success Metric |
|---|---|---|
| 1 | Build benchmark, evaluate v3 | Benchmark ready, v3 numbers on 80 prompts |
| 2 | SFT on Base model, GRPO v4 probe | Base SFT loss < Think SFT loss; v4 probe clip_ratio > 0 |
| 3-4 | GRPO v4 full run (Base + DAPO clip + entropy bonus + GDPO) | eval reward > 0.25; extraction > 0.40 |
| 5-6 | Scale data to 5000+, GRPO v5 | eval reward > 0.35; all tasks > 0.30 |
| 7-8 | Benchmark vs Qwen3-35B-A3B and GPT-4o | Domain parity or better on structured tasks |
| 9-12 | Production deployment, monitoring, iteration | <100ms latency, <$0.002/query, >90% uptime |
Appendix A: Full Paper Reference Table
| Paper | ArXiv ID | Key Finding Used | Applied? |
|---|---|---|---|
| DeepSeek-R1 | 2501.12948 | SFTβGRPO pipeline, rule-based rewards | β Yes |
| Dr. GRPO | 2503.20783 | Remove std normalization, remove length bias, Ξ²=0 | β Partially (std removed, Ξ²=0 in v3) |
| Skywork-OR1 MAGIC | 2505.22312 | Ο=1.0, entropy bonus, filter zero-advantage groups, multi-stage | β Partially (Ο=1.0 in v3, entropy monitor, noise injection) |
| MC-GRPO | 2601.22582 | Median baseline for G=4 | β Not implemented |
| ThinkJSON | 2502.14905 | 1.5B Base + GRPO beats 671B on JSON extraction | β Insight not acted on (still using Think) |
| Reasoning-SQL | 2503.23157 | 7B Base + GRPO beats o3-mini, staged rewards | β Partially (staged rewards in v3) |
| Cocktail Effect | 2410.01109 | 30% general data improves domain performance 2-15% | β Not implemented |
| DAPO | 2503.14476 | Decoupled clip (Clip-Higher), dynamic sampling, overlong filtering | β Not implemented |
| GDPO | 2601.05242 | Decoupled reward normalization preserves fine-grained advantages | β Not implemented |
| MO-GRPO | 2509.22047 | Variance-based reward dominance in multi-reward GRPO | β Not implemented |
| MT-GRPO | 2602.05547 | Dynamic task weighting for balanced multi-task GRPO | β Not implemented |
| EDGE-GRPO | 2507.21848 | Entropy-driven advantage + guided error correction | β Not implemented |
| BAE | 2602.10231 | Blockwise advantages for structured output | β Not implemented |
| Tricks or Traps | 2508.08221 | Local mean + global std for normalization; Clip-Higher verified | β Not implemented |
| RL-Struct | 2512.00319 | Multi-dimensional reward for JSON (structure, format, validity, correctness, length) | β Similar approach used |
| Prompt Augmentation | 2602.03190 | Prompt augmentation overcomes entropy collapse | β Not implemented |
| Train Long Think Short | 2508.08940 | Progressive token budgets via curriculum | β Not implemented |
| OptimalThinkingBench | 2508.13141 | "Don't overthink" prompts | β Applied in v3 |
| L1 | 2503.04697 | Token budgets require RL training to work | β Applied in v3 (reward_think_efficiency) |
| Tucano2 | 2603.03543 | Portuguese CPT on Qwen3, GigaVerbo-v2 datasets | β Base model used |
Appendix B: Repository Structure Summary
rtferraz/tucano2-commerce/
βββ docs/
β βββ PROJECT.md # Comprehensive project documentation
β βββ ADR-001-next-steps.md # Detailed execution plans (benchmark, comparison, v3)
β βββ v3_thinking_control_patch.md # Task-aware thinking control spec
β βββ INVESTIGATION_REPORT.md # β THIS FILE
β βββ checkpoints/
β βββ 2026-04-23_v3-launch.md # v3 launch checkpoint with probe results
βββ notebooks/
β βββ grpo_vertex_v3.ipynb # v3 training notebook (running on Vertex AI)
βββ scripts/
β βββ md_to_ipynb.py # Markdown β notebook converter
βββ grpo_vertex_v2_ipynb.md # v2 reference notebook with all outputs
βββ .gitignore
rtferraz/commerce-model-qwen3.5-lora/
βββ adapter_config.json # LoRA r=16, Ξ±=32, Qwen3.5-9B
βββ adapter_model.safetensors # 111MB adapter weights
βββ chat_template.jinja
βββ processor_config.json
βββ tokenizer.json
βββ tokenizer_config.json
rtferraz/parameter-golf-v2/
βββ ANALYSIS.md # Competition gap analysis
βββ train_final.py # Full training script (SP8192+PAF+TTT+Int6)
βββ train_gpt2.py # Earlier GPT-2 based attempt
Report generated on 2026-04-25 by automated investigation of all project artifacts, cross-referenced with 20+ published papers.