tucano2-commerce / docs /INVESTIGATION_REPORT.md

Add comprehensive investigation report — performance audit, unexplored alternatives, literature-backed recommendations

4312bfd verified 13 days ago

41.8 kB

Tucano2-Commerce: Comprehensive Project Investigation Report

Date: 2026-04-25
Scope: Full audit of training performance, identified issues, unexplored alternatives, and actionable recommendations
Repositories Audited:

rtferraz/tucano2-commerce — Main project repo (docs, notebooks, scripts)
rtferraz/commerce-model-qwen3.5-lora — Qwen3.5-9B SFT LoRA adapter
rtferraz/parameter-golf-v2 — Separate competition project (parameter-efficient LM)

Executive Summary
Project Architecture & Timeline
Every Change That Improved Performance
Every Issue That Needs Improvement
Every Invaluable Lesson Learned
Every Good Aspect of This Model/Training
Unexplored Alternatives — What You Haven't Tried Yet
Literature-Backed Recommendations
Risk Assessment
Conclusion & Priority Roadmap

1. Executive Summary

The Tucano2-Commerce project aims to build a compact (3.7B parameter) domain-specialized LLM for Brazilian e-commerce analysis — sentiment, JSON extraction, SQL generation, churn prediction, and business insights — all in Portuguese. The pipeline follows the DeepSeek-R1 paradigm: Base → SFT → GRPO.

Key Numbers

Metric	Value
Base model	Polygl0t/Tucano2-qwen-3.7B-Think (Qwen3-4B → Portuguese CPT → SFT+Think)
SFT data	~1,650 domain-specific samples
GRPO v2 data	300 prompts (subset)
GRPO v3 data	~1,404 prompts (full)
v2 best eval reward	0.125 (eval) / 0.54 (validation mean) — +42% over SFT baseline
v2 training steps	210/300 (early stopped)
v2 duration	14.9 hours on NVIDIA L4
v3 status	Launched (~500 steps, ~25h estimated)
Critical issues	Entropy collapse, completion length ceiling, thinking model overhead
Hardware	Single NVIDIA L4 (24GB VRAM)

Verdict

The project demonstrates strong engineering discipline and research-driven decision-making. The +42% improvement over SFT baseline is real. However, three structural issues — entropy collapse, thinking model incompatibility with structured output, and data scale — are capping performance. The v3 run addresses these partially, but the literature points to several unexplored approaches that could yield substantially better results.

2. Project Architecture & Timeline

Model Lineage

Qwen/Qwen3-4B-Base
  └→ Polygl0t/Tucano2-qwen-3.7B-Base  (Portuguese continual pretraining, 320B tok corpus)
       └→ Polygl0t/Tucano2-qwen-3.7B-Think  (SFT + thinking training, GigaVerbo-v2)
            └→ YOUR SFT adapter  (domain e-commerce, ~1,650 samples)
                 └→ GRPO v1  (first attempt — killed, zero-signal bug)
                 └→ GRPO v2  (210 steps, +42% over SFT)
                 └→ GRPO v3  (launched, all fixes from ADR-001)

Separate Project: Qwen3.5-9B LoRA

Qwen/Qwen3.5-9B
  └→ rtferraz/commerce-model-qwen3.5-lora  (LoRA: r=16, α=32, 111MB adapter)

This is a separate SFT experiment on a larger model (9B). The adapter config shows standard LoRA targeting all linear layers (q,k,v,o,gate,up,down projections) with r=16, α=32, no dropout. No training metrics or README details were saved — only the default Unsloth template.

Separate Project: Parameter Golf v2

A competition entry for parameter-efficient language modeling (BPB metric). Uses Int6 GPTQ quantization, SP8192 tokenizer, parallel residual architecture, depth recurrence, Muon optimizer, and TTT (test-time training). Sophisticated work — shows strong systems engineering capability.

3. Every Change That Improved Performance

3.1 Binary → Continuous Reward Functions ✅ (+50% training signal)

Before	After	Evidence
Binary rewards (0/1)	Continuous rewards (0.0-1.0) with partial credit	`reward_std=0` dropped from 50% to ~10% of steps

Why it worked: Binary rewards create groups where all completions get 0 or all get 1. With GRPO's group-relative normalization, zero variance → zero advantage → zero gradient. Continuous rewards ensure reward variance exists in nearly every group.

Paper backing: Dr. GRPO (2503.20783) §3.1 proves that std-based normalization amplifies this problem — groups with low std get inflated gradient contributions.

3.2 Temperature 0.1 → 0.8 ✅ (Training went from non-functional to functional)

Before	After	Evidence
temp=0.1 (Qwen3 default in `generation_config.json`)	temp=0.8	`frac_reward_zero_std` went from 1.0 (every step) to ~0.0

Why it worked: Low temperature makes all G=8 rollouts near-identical → zero reward variance → zero advantage → zero gradient. This was the single most destructive bug — the entire v1 run produced zero learning signal.

Paper backing: Skywork-OR1 (2505.22312) §3.1: τ=1.0 "enhances exploration capability and improves learning plasticity." Their ablation (§3.2.4) shows τ=0.6 immediately enters low-entropy state.

3.3 `scale_rewards=False` (Dr. GRPO fix) ✅

Before	After	Evidence
Default GRPO std normalization	Removed std normalization	More stable training; eliminated most zero-gradient steps

Why it worked: Standard GRPO divides advantages by std(rewards) per group. When a group has near-uniform rewards, the small std inflates gradients → training instability + bias toward "easy" prompts (the difficulty bias).

Paper backing: Dr. GRPO (2503.20783) §3.1 formally proves this bias and shows removing it achieves SOTA 43.3% on AIME 2024 with 7B.

3.4 EVAL_MAX_TOKENS 256 → 2048 ✅ (Prevented premature early stopping)

Before	After	Evidence
256 eval tokens	2048 eval tokens	Training ran to 210 steps vs. killed at step 40

Why it worked: The Think model needs 500-700+ tokens just for </think>. At 256 tokens, eval always scored incomplete generations → flat eval metrics → early stopping fired after 3 evals → killed training at step 40.

3.5 Early Stopping Patience 3 → 10 ✅

Before	After	Evidence
3 consecutive evals	10 consecutive evals	100 steps of runway before halt (was 30)

Why it worked: GRPO training is noisy — reward doesn't monotonically improve. Patience=3 was too aggressive.

3.6 `UnslothGRPOTrainer` Wrapper ✅ (~2-3× generation speedup)

Wraps _generate() with for_inference()/for_training() to activate Unsloth's optimized Triton kernels during generation. Without this: ~3-4 tok/s. With: ~8-15 tok/s on L4.

3.7 `processing_class=tokenizer` Fix ✅

In TRL 0.24.0, passing tokenizer=tokenizer to GRPOTrainer was silently dropped. Changed to processing_class=tokenizer. Without this fix, the eval callback received None as tokenizer.

3.8 Reward Normalization (extraction reward capped to 1.0) ✅

The extraction reward function originally scored up to 2.0 while others maxed at 1.0 → extraction gradients were 2× larger → biased optimization toward extraction at the expense of other tasks.

Paper backing: MO-GRPO (2509.22047) Theorem 1 proves that GRPO advantages are more correlated with higher-variance reward components. Unnormalized rewards cause exactly this.

3.9 v3 Changes (Launched, Awaiting Results)

Change	From	To	Paper
Temperature	0.8	1.0	Skywork-OR1
max_completion_length	2048	4096	Dr. GRPO
num_generations	8	4	MC-GRPO (VRAM tradeoff)
learning_rate	5e-7	2e-6	Dr. GRPO Appendix G
β (KL penalty)	implicit	0.0	Dr. GRPO §3.2
Training data	300 subset	~1,400 (all)	Scale fix
System prompts	generic	4 task-aware	OptimalThinkingBench
Think efficiency reward	none	`reward_think_efficiency()`	L1 paper
Zero-advantage groups	included	noise injection (σ=0.005)	Skywork-OR1
grad_accum	2	1	Effective batch 4

4. Every Issue That Needs Improvement

4.1 🔴 CRITICAL: Entropy Collapse (clip_ratio=0 on ALL steps)

Evidence: v2 logs show clip_ratio=0 on every single training step. KL divergence = 0.004. The policy barely moved from the SFT initialization.

What this means: The PPO clipping mechanism is designed to prevent the policy from moving too far. But clip_ratio=0 means the policy never even approached the clipping boundary — it's not that clipping is preventing movement, it's that the policy has no gradient signal large enough to push it.

Root cause analysis (with paper evidence):

DAPO (2503.14476) §3.1 — Clip-Higher: Standard PPO clips at [1-ε, 1+ε]. For low-probability "exploration" tokens (p=0.01), the upper bound is only 0.012 — the token can barely increase its probability. Meanwhile, high-probability "exploitation" tokens (p=0.9) can go to 1.08. This asymmetry means the upper clip restricts exploration far more than exploitation. DAPO proposes decoupled clip with εₗₒw=0.2, εₕᵢgₕ=0.28 — wider upper clip to encourage exploration.
Skywork-OR1 (2505.22312) §4: On-policy training significantly slows entropy collapse. Off-policy updates (multiple gradient steps per rollout) accelerate it. Your current setup does 1 gradient step per rollout (on-policy) — this is correct but insufficient without the entropy bonus.
EDGE-GRPO (2507.21848): Even with temperature=1.0, models can still collapse to near-deterministic output. The paper proposes Entropy-Driven Advantage (EDA) — dividing advantages by normalized per-response entropy, which amplifies the advantage of diverse responses.

How to fix:

Add explicit entropy bonus to loss (Skywork-OR1 MAGIC loss, Eq. 3.1): α_k * H_ij^t(θ) where α starts at 5e-3 and decays. This requires modifying the loss function.
Implement DAPO's Clip-Higher: Set εₗₒw=0.2, εₕᵢgₕ=0.28 (or even higher). This is a TRL config change if supported, or requires trainer subclass.
Filter zero-advantage groups completely (Skywork-OR1 §3.1), not just add noise. Remove entire prompts where all G completions get identical rewards.

4.2 🔴 CRITICAL: Thinking Model Incompatibility with Structured Output

Evidence:

v2 calibration: 8/8 samples hit 2048 ceiling
v3 calibration (temp=0.7): 8/8 samples hit 4096 ceiling, both extraction samples stuck in <think>
Prompt-level control ("Não pense em excesso") had zero measurable effect at inference time
L1 paper (2503.04697) confirms: untrained models ignore length instructions

Root cause: The Think model's chat template always injects <think> on the last assistant turn — there is no enable_thinking conditional (unlike official Qwen3-4B). The model was trained to think extensively, and this behavior is deeply embedded in its weights.

Why this matters for extraction/SQL:

Extraction needs ~50-100 tokens of output (JSON). The model produces 2000-3000 tokens of <think> first.
At temp=0.1 (inference), the model deterministically fills the entire context with thinking.
At temp=1.0 (training), completions are shorter (358-528 tokens avg) — but this creates a train-test distribution mismatch.

How to fix:

Switch to Base model — Polygl0t/Tucano2-qwen-3.7B-Base. Every canonical GRPO paper starts from base/instruct, not thinking models. DeepSeek-R1-Zero proved thinking emerges from RL. ThinkJSON (2502.14905) beats R1-671B on JSON extraction using Qwen2.5-1.5B-Base + GRPO. This requires re-running SFT (LoRA adapters are model-specific).
Hybrid deployment — Use Think model for insights (where thinking adds value), Base model for extraction/SQL/push (where thinking hurts).
Modify chat template — Fork the template to conditionally disable <think> injection for extraction/push tasks. This is a workaround, not a fix.

4.3 🟡 MODERATE: Data Scale (300→1400, Still Below Literature Minimum)

Evidence:

v2: 300 prompts → early stopping at step 210 (70% of one epoch)
v3: ~1,400 prompts → 500 steps planned
Literature minimum: Skywork-OR1 uses 30K+ prompts. DeepSeek-R1 uses 600K+.

Why this matters: With 1,400 prompts, the model sees each prompt only once. There's no second-epoch reinforcement. The reward signal is thin — each task type has only 100-650 examples.

How to fix:

Synthetic data augmentation using GPT-4o or the SFT model itself (planned in ADR-001)
Data mixing with general reasoning — Cocktail Effect paper (2410.01109) shows 30% general data improves domain by 2-15%
Target 5,000+ prompts for meaningful multi-epoch training

4.4 🟡 MODERATE: Multi-Task Reward Interference

Evidence:

Bimodal performance: insights/analysis (0.50-0.70) vs. extraction (0.12)
Extraction reward was previously 2× the scale of other rewards (fixed in v2)
v3 uses single composite reward summing all components

Root cause (paper evidence):

MO-GRPO (2509.22047) Theorem 1: In standard GRPO, the advantage function is more correlated with reward components that have higher variance. If reward_insights has variance 0.1 but reward_extraction has variance 0.01 (because extraction either works or doesn't), GRPO will preferentially optimize for insights.
GDPO (2601.05242) §3.1: When GRPO sums multiple rewards before normalization, distinct reward combinations can map to identical advantages — losing information. E.g., (format=0, content=1) and (format=1, content=0) both sum to 1 → same advantage, despite being completely different errors.
Multi-Task GRPO (2602.05547) §3: Standard average reward maximization can allow large gains on easy tasks to compensate for stagnation on hard tasks. Their formulation explicitly bounds inter-task performance disparity.

How to fix:

GDPO: Normalize each reward component separately before summing. This preserves fine-grained advantage distinctions.
Multi-Task GRPO: Dynamic task weighting that upweights underperforming tasks (extraction) and downweights saturating tasks (insights).
Conditional rewards: Gate easier rewards (format) on harder ones (content accuracy). Model only receives format reward if content is above a threshold (GDPO §3.2, Eq. 8).

4.5 🟡 MODERATE: No Formal Benchmark

Evidence: Evaluation uses 5 held-out prompts scored by the reward function itself. There's no independent benchmark with ground truth, no comparison against baselines (Qwen3-3.7B base, GPT-4o), no standardized metrics.

How to fix: Phase 1 of ADR-001 is well-designed (80 prompts, per-task scorers, multiple baselines). Execute it.

4.6 🟢 MINOR: TRL 0.24.0 Lock

Evidence: Pinned to TRL 0.24.0 for Unsloth compatibility. Newer TRL versions have:

Native entropy_coeff in GRPOConfig
Better logging (clip ratios per-positive/negative)
Bug fixes for generation config handling

How to fix: Either upgrade Unsloth or implement needed features via callbacks/trainer subclass (v3 already does this for entropy monitoring).

4.7 🟢 MINOR: Single GPU Training Bottleneck

Evidence:

Smoke test: 318s/step → 13.2h for 75 steps
v2 full run: ~4.3 min/step → 14.9h for 210 steps
v3 estimated: ~3 min/step → 25h for 500 steps

With G=4 and max_completion_length=4096, generation dominates training time. vLLM was available but not used (USE_VLLM=False).

How to fix:

Enable vLLM colocate mode for faster generation
Consider multi-GPU setup (2×L4 or A100) for generation parallelism

5. Every Invaluable Lesson Learned

5.1 Technical Lessons

Default model generation configs will silently destroy your RL training. Qwen3's generation_config.json sets temperature=0.1. This single default was responsible for the complete failure of v1. Always explicitly override every generation parameter.
The reward function is the product specification. Binary rewards → zero signal. Continuous rewards with partial credit → training works. Multi-component rewards with staged convergence → format learns first, content follows. The time spent designing rewards is the most valuable engineering time.
GRPO needs diversity to learn — diversity in completions AND diversity in prompts. Low temperature → identical completions → zero advantage. Few prompts → memorization → entropy collapse. Short completion budget → truncation → reward ceiling. All three destroy the algorithm's fundamental mechanism: comparing different outcomes to the same prompt.
TRL's step calculation includes a num_generations multiplier. steps = num_prompts × num_generations / (batch_size × grad_accum). Missing this gives wrong epoch estimates. MAX_STEPS always overrides NUM_EPOCHS.
Early stopping parameters must match the model's output characteristics. A thinking model needs 500+ tokens for </think>. Evaluating at 256 tokens scores incomplete generations → flat metrics → premature stop.
Entropy collapse is the GRPO failure mode — not divergence, not reward hacking. The model collapses to deterministic output. Monitoring clip_ratio and generation entropy is more important than monitoring reward.
Calibration at inference temperature ≠ training behavior. Calibrating at temp=0.7 showed catastrophic results (100% ceiling hits). But actual training at temp=1.0 showed healthy dynamics (358-528 token avg, 0% ceiling). Future calibration must include a temp=1.0 pass.
LoRA adapters are model-specific. Can't transfer adapters from Think→Base model. Switching base model requires re-running SFT from scratch.
Thinking models and structured output tasks are fundamentally in tension when completion budgets are constrained. The <think> block consumes tokens that the task output needs.
KV cache correctness matters. The diagnostic cell (5b) correctly identified that KV cache was working (ratio 0.7×). Had it been broken (>5×), generation would have been catastrophically slow.

5.2 Process Lessons

Budget 3-5 iterations, not 1. v1 found the zero-signal bug. v2 found the temperature bug and completion ceiling. v3 addresses entropy collapse. Each iteration is cheaper because you know what to measure.
Literature crawl before implementation saves compute. The research found 6 papers on thinking control, Dr. GRPO's bias fixes, Skywork-OR1's entropy analysis, and the entire GRPO variant ecosystem — all directly applicable. Without this, you'd discover these issues empirically at $2/GPU-hour.
The model family tree matters. Discovering that Tucano2-Think → Tucano2-Base → Qwen3-4B-Base gave a clean non-thinking alternative with Portuguese preserved.
Log everything from the start. Moving W&B init to the beginning of the notebook means even preflight checks survive kernel disconnections.
Documentation is debugging. The project has excellent documentation (PROJECT.md, ADR-001, checkpoint logs, v3 patch spec). This made the entire investigation possible. Without docs, understanding 14.9 hours of training would require reading raw W&B logs.

5.3 Business Lessons

Domain data is the moat, not model size. ThinkJSON (1.5B) beats DeepSeek-R1 (671B) on JSON extraction. The 42% improvement from domain GRPO on 300 examples validates this thesis.
Self-hosting economics are immediately favorable. $0.001/analysis (GPU) vs $0.01+ (API). Breakeven at ~100 analyses/day.
Portuguese-first is a defensible advantage. Most LLM development is English-first. A model that understands Brazilian e-commerce Portuguese ("veio com defeito", "nota 1 estrela") has competitive moat.

6. Every Good Aspect of This Model/Training

6.1 Architecture Decisions

✅ Correct pipeline choice (SFT → GRPO). The DeepSeek-R1 paradigm is validated by multiple papers and is the right approach for rule-based reward domains.

✅ Correct base model selection. Qwen3-4B with Portuguese continual pretraining (Tucano2) is arguably the best available foundation for this task size. The Tucano2 paper (2603.03543) shows it achieves SOTA on Portuguese benchmarks. Using a Portuguese-specialized model instead of vanilla Qwen3 is the right call.

✅ Rule-based rewards over neural reward model. For structured tasks with verifiable outputs (JSON schema, SQL execution), rule-based rewards are objectively superior. DeepSeek-R1 demonstrated this. Neural reward models at this scale would introduce reward hacking.

✅ 4-bit quantization (NF4) via Unsloth. Enables a 3.7B model to fit in 24GB VRAM with headroom. The VRAM budget analysis (Cell 9 smoke test) confirmed 6.8GB/23.6GB peak — massive headroom.

✅ LoRA over full fine-tuning for SFT. With only 1,650 training samples, full fine-tuning would overfit. LoRA (r=16, α=32, 33M/3.8B trainable params = 0.87%) is appropriate.

6.2 Engineering Practices

✅ Gated cell execution (Cells 1-13). Each cell is a verification gate — verify output before proceeding. This prevents cascading failures.

✅ Comprehensive diagnostic cells. KV cache test (5b), inference test (5), reward calibration (7), smoke test (9), probe run (10) — all before committing to the full run. This is excellent practice.

✅ Weight drift validation (Cell 11 safety checks). Testing 50 merge/unmerge cycles for LoRA weight drift, memory leak detection, and gradient flow verification. No other project I've audited does this.

✅ UNSLOTH_COMPILE_DISABLE=1. Prevents Triton kernel recompilation on every for_inference()/for_training() switch. This shows understanding of Unsloth internals.

✅ Proper checkpoint management. save_steps=10-15, save_total_limit=3-5, save_only_model=True — efficient disk usage with enough coverage for Spot VM preemption recovery.

✅ Multi-task reward design. Separate reward functions for extraction, SQL, insights, and push notifications — each with domain-specific heuristics. The extraction reward scores 10 individual JSON fields with appropriate validators.

6.3 Research Methodology

✅ Every decision is paper-backed. Dr. GRPO for std normalization. Skywork-OR1 for temperature. MC-GRPO for group size. ThinkJSON for the domain specialization thesis. This is research-grade engineering.

✅ Proactive issue diagnosis. The project identified entropy collapse, completion ceiling, and data scale as root causes — not just symptoms. The analysis correctly attributes clip_ratio=0 to entropy collapse (not insufficient learning rate or wrong reward function).

✅ Clear documentation with decision log. PROJECT.md has a formal decision log with context, problem, decision, consequence, and reference for every choice. This is ADR (Architecture Decision Record) quality.

6.4 Training Results

✅ +42% over SFT baseline is significant. Going from 0.38 (SFT calibration) to 0.54 (GRPO v2 validation mean) demonstrates that GRPO is providing real value, even with all the issues.

✅ Bimodal performance reveals the problem structure. The fact that insights/analysis (0.50-0.70) work well while extraction (0.12) doesn't tells you exactly where to focus: structured output + thinking model = the bottleneck.

✅ Zero frac_reward_zero_std after v2 fixes. The reward engineering is correct — every group now has reward variance. The remaining issue is that advantages are too small to overcome the clip boundary.

7. Unexplored Alternatives — What You Haven't Tried Yet

7.1 🔴 Base Model GRPO (Highest Expected Impact)

What: Train GRPO starting from Polygl0t/Tucano2-qwen-3.7B-Base instead of -Think.

Why it's unexplored: The project committed to the Think model early and hasn't tested the Base alternative.

Literature evidence:

DeepSeek-R1-Zero: Proved that thinking/reasoning emerges from RL training on base models — you don't need a pre-trained thinker.
ThinkJSON (2502.14905): Qwen2.5-1.5B-Base + GRPO beats DeepSeek-R1-671B on JSON extraction. Base model = no <think> overhead = more tokens for actual output.
Reasoning-SQL (2503.23157): 7B base model + GRPO beats o3-mini on SQL.
Your own analysis (checkpoint log): "Every canonical GRPO paper starts from base/instruct, not thinking models."

Expected impact:

Extraction score: 0.12 → 0.50+ (elimination of <think> overhead = JSON fits in completion budget)
Completion efficiency: 3000 → 200-500 tokens for extraction
Training speed: ~2× faster (shorter completions)

Cost: Requires re-running SFT (~~2-4 hours on L4), then GRPO (~~25 hours).

7.2 🔴 DAPO's Decoupled Clip (Directly Addresses Entropy Collapse)

What: Replace symmetric clip [1-ε, 1+ε] with asymmetric [1-εₗₒw, 1+εₕᵢgₕ] where εₕᵢgₕ > εₗₒw.

Why it's unexplored: Not available in TRL 0.24.0 as a config option. Requires trainer subclass modification.

Literature evidence:

DAPO (2503.14476) §3.1: Standard symmetric clipping restricts low-probability exploration tokens far more than high-probability exploitation tokens. A token with p=0.01 can only reach 0.012, while p=0.9 can reach 1.08. Decoupled clip with εₗₒw=0.2, εₕᵢgₕ=0.28 specifically allows exploration tokens to increase more.
Tricks or Traps (2508.08221) §4.2: Independently verifies that Clip-Higher is one of the most impactful single techniques for preventing entropy collapse. Their "Lite PPO" achieves strong results with just normalization fix + Clip-Higher.
Your symptom matches exactly: clip_ratio=0 means no tokens are being clipped in either direction. The upper clip is preventing exploration before the policy even reaches it.

Expected impact: Non-zero clip_ratio → actual policy movement → real learning signal.

7.3 🟡 GDPO for Multi-Task Rewards (Fixes Reward Interference)

What: Normalize each reward component separately before summing, then apply batch-wise normalization.

Why it's unexplored: Your current approach sums all reward components into a single scalar before GRPO's group normalization.

Literature evidence:

GDPO (2601.05242) §3.1: When summing K rewards before normalization, distinct reward combinations collapse to identical advantages. With 4 tasks × 4 reward components, you're losing substantial gradient information.
MO-GRPO (2509.22047): Proves (Theorem 1) that advantage correlation with each reward component is proportional to that component's standard deviation. Higher-variance rewards dominate, regardless of importance.

Implementation: For each prompt group, normalize each of the 4 task-specific rewards independently, then sum the normalized advantages, then apply batch-level normalization.

7.4 🟡 Multi-Task GRPO Dynamic Weighting (Fixes Task Imbalance)

What: Dynamically upweight underperforming tasks (extraction) during training.

Why it's unexplored: Current approach uses fixed stratified sampling (40% extraction, 40% SQL, 10% insights, 10% push) but equal reward weighting.

Literature evidence:

MT-GRPO (2602.05547): Proposes improvement-aware weight update (IWU) that tracks per-task reward improvement rates and upweights tasks that are stagnating. Avoids the collapse-to-worst-task problem of naive minimax.
Key insight: Use true task-level rewards for weight updates, not GRPO loss (which is ambiguous — zero loss could mean all-correct or all-incorrect).

7.5 🟡 Blockwise Advantage Estimation (For Structured Multi-Part Output)

What: Assign separate advantages to different parts of the output (think block vs. JSON/answer block).

Why it's unexplored: Current GRPO assigns one advantage to the entire completion.

Literature evidence:

BAE (2602.10231): For structured generations (like <think>...</think>JSON...), outcome-level advantage assigns the same gradient signal to thinking tokens and answer tokens. But the thinking block's quality might differ from the answer block's quality. BAE assigns separate advantages to each block using outcome-conditioned baselines.

Implementation: Split completion into blocks at </think>. Score thinking block separately (was it concise? was it relevant?). Score answer block separately (was it correct?). Different advantages for different blocks.

7.6 🟡 EDGE-GRPO: Entropy-Driven Advantage (Directly Addresses Advantage Collapse)

What: Scale advantages by inverse normalized entropy — responses that are both correct AND confident get higher advantages.

Why it's unexplored: Requires computing per-response entropy during training.

Literature evidence:

EDGE-GRPO (2507.21848): When the model generates near-identical responses, the advantages are near-zero (advantage collapse). EDA divides advantages by normalized entropy: Â_i = A_i / P̂_i. This amplifies advantages for diverse, confident-correct responses and penalizes confident-incorrect ones.
Also uses Guided Error Correction (GEC): For incorrect responses, inject the correct answer 25% of the time → ensures each group contains positive examples. This is especially useful for hard tasks like extraction where the model might get 0/8 correct.

7.7 🟢 Curriculum Learning with Progressive Context Length (Skywork-OR1)

What: Start training with shorter max_completion_length and progressively increase it across stages.

Why it's unexplored: v2 used fixed 2048, v3 uses fixed 4096.

Literature evidence:

Skywork-OR1 (2505.22312) §3.2.2: Multi-stage training (progressive context length) "significantly reduces computational costs while preserving scalability." Start with 2048 → 4096 → 8192.
Train Long Think Short (2508.08940): Curriculum GRPO that progressively tightens token budgets improves accuracy AND token efficiency.

7.8 🟢 Prompt Augmentation to Scale Data

What: Generate paraphrased/augmented versions of existing prompts to increase effective dataset size.

Literature evidence:

Prompt Augmentation for GRPO (2602.03190): Augmenting training prompts (rephrasing, adding context variations) enables longer training without entropy collapse. "Prompt augmentation scales up GRPO training."
Cocktail Effect (2410.01109): Mixing 30% general reasoning data with domain data improves domain performance by 2-15%.

7.9 🟢 DPO as Complement or Alternative

What: Use the SFT model to generate completions, score them with reward functions, and create preference pairs for DPO training.

Why it's unexplored: Project committed to GRPO from the start.

Literature evidence:

Iterative DPO (2503.12854): DPO is computationally efficient and can achieve comparable results to RL for some tasks. Iteratively generating new preference pairs from the current policy and training DPO is effective.
Tucano2 paper (2603.03543) §9: The Tucano2 Think model itself was post-trained using APO (a DPO variant) on GigaVerbo-v2 Preferences. This means the base Think model already has DPO in its training history — adding GRPO on top creates a complex interaction.

When to use: DPO might be more appropriate for tasks with clear good/bad pairs (extraction: valid JSON vs. invalid JSON) where you don't need the exploration that GRPO provides.

7.10 🟢 Separate Task-Specific LoRA Adapters

What: Train separate LoRA adapters for each task instead of one multi-task adapter.

Why it's unexplored: Current approach uses one adapter for all 4 tasks.

Rationale: Extraction and insights have fundamentally different optimal behaviors (terse JSON vs. verbose analysis). A single adapter must compromise. Separate adapters + routing would let each task optimize independently.

7.11 🟢 vLLM for Generation Speedup

What: Enable USE_VLLM=True in the training config.

Why it's unexplored: Available in the codebase (USE_VLLM = False in Cell 3) but disabled.

Expected impact: 10-20× generation speedup → total training time could drop from 25h to ~5-8h.

8. Literature-Backed Recommendations

Priority Matrix

#	Action	Expected Impact	Effort	Risk	Paper Evidence
1	Switch to Base model	🔴 Transformative (extraction 0.12→0.50+)	Medium (re-SFT required)	Low	ThinkJSON, DeepSeek-R1-Zero, Reasoning-SQL
2	Implement DAPO Clip-Higher	🔴 High (fixes clip_ratio=0)	Medium (trainer subclass)	Low	DAPO §3.1, Tricks or Traps §4.2
3	Add entropy bonus to loss	🔴 High (prevents entropy collapse)	Medium (trainer subclass)	Low	Skywork-OR1 MAGIC (Eq. 3.1)
4	GDPO reward normalization	🟡 Moderate (fixes task interference)	Low (reward fn change)	Low	GDPO §3.1, MO-GRPO Theorem 1
5	Build formal benchmark	🟡 Moderate (enables measurement)	Low (1-2 days)	None	—
6	Scale to 5000+ prompts	🟡 Moderate	Medium (data generation)	Low	Skywork-OR1, Cocktail Effect
7	Dynamic task weighting	🟡 Moderate (helps extraction)	Medium	Low	MT-GRPO §3
8	Enable vLLM	🟢 Low (speed only)	Low	Low	—
9	Curriculum context length	🟢 Low-Moderate	Low	Low	Skywork-OR1 §3.2.2
10	Blockwise advantages	🟢 Low-Moderate	High	Medium	BAE (2602.10231)

Recommended Execution Order

IMMEDIATE (while v3 runs):
  → Build benchmark (Phase 1 of ADR-001)
  → Prepare Base model SFT data

AFTER v3 COMPLETES:
  → Evaluate v3 vs v2 on benchmark
  → If extraction still < 0.3: Switch to Base model
  → Re-run SFT on Base model
  → GRPO v4 on Base with: DAPO clip, entropy bonus, GDPO rewards, 
    dynamic task weighting, 5000+ prompts

IF v4 STILL SHOWS ENTROPY COLLAPSE:
  → Try EDGE-GRPO (GEC + EDA)
  → Try DPO as fallback for extraction/SQL specifically

DEPLOYMENT:
  → Hybrid: Base model for extraction/SQL/push, Think model for insights
  → Or: Single Base model with all tasks (likely better overall)

9. Risk Assessment

Risks of Current v3 Run

Risk	Probability	Impact	Mitigation
Entropy collapse persists (clip_ratio=0 after step 50)	High (70%)	Training produces marginal improvement	Add entropy bonus, DAPO clip in v4
Think model still can't produce JSON at inference (temp=0.1)	Very High (90%)	Good training metrics but poor deployment	Switch to Base model
25h training gets preempted on Spot VM	Medium (30%)	Lost progress	Checkpoint every 10 steps ✅
reward_think_efficiency has no effect	High (60%)	Think overhead unchanged	L1 paper says RL reward needed; single run may not learn

Risks of Recommended Changes

Risk	Probability	Impact	Mitigation
Base model SFT loses Portuguese quality	Low (10%)	Need to re-CPT	Tucano2-Base already has Portuguese CPT
DAPO clip causes training instability	Low (15%)	NaN loss	Start with εₕᵢgₕ=0.22 (conservative)
Data augmentation introduces noise	Medium (30%)	Reward signal degraded	Validate synthetic data quality with reward function

10. Conclusion & Priority Roadmap

What's Working

The SFT → GRPO pipeline is correct and producing measurable improvements (+42%)
The reward function engineering is solid (continuous, multi-component, calibrated)
The infrastructure and methodology are research-grade
Portuguese domain specialization thesis is validated

What's Not Working

Entropy collapse prevents real policy learning (clip_ratio=0)
Thinking model is fundamentally incompatible with structured output under token constraints
Data scale is 10-100× below published minimums

The Single Highest-Impact Change

Switch from Think model to Base model. This one change addresses two of three critical issues simultaneously:

Eliminates the <think> overhead that blocks extraction/SQL output
Reduces completion lengths → faster training → more steps per hour
Aligns with every canonical GRPO paper's methodology

Combined with DAPO's Clip-Higher and Skywork-OR1's entropy bonus, this should break through the v2 performance plateau.

90-Day Roadmap

Week	Action	Success Metric
1	Build benchmark, evaluate v3	Benchmark ready, v3 numbers on 80 prompts
2	SFT on Base model, GRPO v4 probe	Base SFT loss < Think SFT loss; v4 probe clip_ratio > 0
3-4	GRPO v4 full run (Base + DAPO clip + entropy bonus + GDPO)	eval reward > 0.25; extraction > 0.40
5-6	Scale data to 5000+, GRPO v5	eval reward > 0.35; all tasks > 0.30
7-8	Benchmark vs Qwen3-35B-A3B and GPT-4o	Domain parity or better on structured tasks
9-12	Production deployment, monitoring, iteration	<100ms latency, <$0.002/query, >90% uptime

Appendix A: Full Paper Reference Table

Paper	ArXiv ID	Key Finding Used	Applied?
DeepSeek-R1	2501.12948	SFT→GRPO pipeline, rule-based rewards	✅ Yes
Dr. GRPO	2503.20783	Remove std normalization, remove length bias, β=0	✅ Partially (std removed, β=0 in v3)
Skywork-OR1 MAGIC	2505.22312	τ=1.0, entropy bonus, filter zero-advantage groups, multi-stage	✅ Partially (τ=1.0 in v3, entropy monitor, noise injection)
MC-GRPO	2601.22582	Median baseline for G=4	❌ Not implemented
ThinkJSON	2502.14905	1.5B Base + GRPO beats 671B on JSON extraction	❌ Insight not acted on (still using Think)
Reasoning-SQL	2503.23157	7B Base + GRPO beats o3-mini, staged rewards	✅ Partially (staged rewards in v3)
Cocktail Effect	2410.01109	30% general data improves domain performance 2-15%	❌ Not implemented
DAPO	2503.14476	Decoupled clip (Clip-Higher), dynamic sampling, overlong filtering	❌ Not implemented
GDPO	2601.05242	Decoupled reward normalization preserves fine-grained advantages	❌ Not implemented
MO-GRPO	2509.22047	Variance-based reward dominance in multi-reward GRPO	❌ Not implemented
MT-GRPO	2602.05547	Dynamic task weighting for balanced multi-task GRPO	❌ Not implemented
EDGE-GRPO	2507.21848	Entropy-driven advantage + guided error correction	❌ Not implemented
BAE	2602.10231	Blockwise advantages for structured output	❌ Not implemented
Tricks or Traps	2508.08221	Local mean + global std for normalization; Clip-Higher verified	❌ Not implemented
RL-Struct	2512.00319	Multi-dimensional reward for JSON (structure, format, validity, correctness, length)	✅ Similar approach used
Prompt Augmentation	2602.03190	Prompt augmentation overcomes entropy collapse	❌ Not implemented
Train Long Think Short	2508.08940	Progressive token budgets via curriculum	❌ Not implemented
OptimalThinkingBench	2508.13141	"Don't overthink" prompts	✅ Applied in v3
L1	2503.04697	Token budgets require RL training to work	✅ Applied in v3 (reward_think_efficiency)
Tucano2	2603.03543	Portuguese CPT on Qwen3, GigaVerbo-v2 datasets	✅ Base model used

Appendix B: Repository Structure Summary

rtferraz/tucano2-commerce/
├── docs/
│   ├── PROJECT.md                          # Comprehensive project documentation
│   ├── ADR-001-next-steps.md               # Detailed execution plans (benchmark, comparison, v3)
│   ├── v3_thinking_control_patch.md        # Task-aware thinking control spec
│   ├── INVESTIGATION_REPORT.md             # ← THIS FILE
│   └── checkpoints/
│       └── 2026-04-23_v3-launch.md         # v3 launch checkpoint with probe results
├── notebooks/
│   └── grpo_vertex_v3.ipynb                # v3 training notebook (running on Vertex AI)
├── scripts/
│   └── md_to_ipynb.py                      # Markdown → notebook converter
├── grpo_vertex_v2_ipynb.md                 # v2 reference notebook with all outputs
└── .gitignore

rtferraz/commerce-model-qwen3.5-lora/
├── adapter_config.json                     # LoRA r=16, α=32, Qwen3.5-9B
├── adapter_model.safetensors               # 111MB adapter weights
├── chat_template.jinja
├── processor_config.json
├── tokenizer.json
└── tokenizer_config.json

rtferraz/parameter-golf-v2/
├── ANALYSIS.md                             # Competition gap analysis
├── train_final.py                          # Full training script (SP8192+PAF+TTT+Int6)
└── train_gpt2.py                           # Earlier GPT-2 based attempt

Report generated on 2026-04-25 by automated investigation of all project artifacts, cross-referenced with 20+ published papers.

Tucano2-Commerce: Comprehensive Project Investigation Report

Table of Contents

1. Executive Summary

Key Numbers

Verdict

2. Project Architecture & Timeline

Model Lineage

Separate Project: Qwen3.5-9B LoRA

Separate Project: Parameter Golf v2

3. Every Change That Improved Performance

3.1 Binary → Continuous Reward Functions ✅ (+50% training signal)

3.2 Temperature 0.1 → 0.8 ✅ (Training went from non-functional to functional)

3.3 scale_rewards=False (Dr. GRPO fix) ✅

3.4 EVAL_MAX_TOKENS 256 → 2048 ✅ (Prevented premature early stopping)

3.5 Early Stopping Patience 3 → 10 ✅

3.6 UnslothGRPOTrainer Wrapper ✅ (~2-3× generation speedup)

3.7 processing_class=tokenizer Fix ✅

3.8 Reward Normalization (extraction reward capped to 1.0) ✅

3.9 v3 Changes (Launched, Awaiting Results)

4. Every Issue That Needs Improvement

4.1 🔴 CRITICAL: Entropy Collapse (clip_ratio=0 on ALL steps)

4.2 🔴 CRITICAL: Thinking Model Incompatibility with Structured Output

4.3 🟡 MODERATE: Data Scale (300→1400, Still Below Literature Minimum)

4.4 🟡 MODERATE: Multi-Task Reward Interference

4.5 🟡 MODERATE: No Formal Benchmark

4.6 🟢 MINOR: TRL 0.24.0 Lock

4.7 🟢 MINOR: Single GPU Training Bottleneck

5. Every Invaluable Lesson Learned

5.1 Technical Lessons

5.2 Process Lessons

5.3 Business Lessons

6. Every Good Aspect of This Model/Training

6.1 Architecture Decisions

6.2 Engineering Practices

6.3 Research Methodology

6.4 Training Results

7. Unexplored Alternatives — What You Haven't Tried Yet

7.1 🔴 Base Model GRPO (Highest Expected Impact)

7.2 🔴 DAPO's Decoupled Clip (Directly Addresses Entropy Collapse)

7.3 🟡 GDPO for Multi-Task Rewards (Fixes Reward Interference)

7.4 🟡 Multi-Task GRPO Dynamic Weighting (Fixes Task Imbalance)

7.5 🟡 Blockwise Advantage Estimation (For Structured Multi-Part Output)

7.6 🟡 EDGE-GRPO: Entropy-Driven Advantage (Directly Addresses Advantage Collapse)

7.7 🟢 Curriculum Learning with Progressive Context Length (Skywork-OR1)

7.8 🟢 Prompt Augmentation to Scale Data

7.9 🟢 DPO as Complement or Alternative

7.10 🟢 Separate Task-Specific LoRA Adapters

7.11 🟢 vLLM for Generation Speedup

8. Literature-Backed Recommendations

Priority Matrix

Recommended Execution Order

9. Risk Assessment

Risks of Current v3 Run

Risks of Recommended Changes

10. Conclusion & Priority Roadmap

What's Working

What's Not Working

The Single Highest-Impact Change

90-Day Roadmap

Appendix A: Full Paper Reference Table

Appendix B: Repository Structure Summary

3.3 `scale_rewards=False` (Dr. GRPO fix) ✅

3.6 `UnslothGRPOTrainer` Wrapper ✅ (~2-3× generation speedup)

3.7 `processing_class=tokenizer` Fix ✅