Tucano2-Commerce: Domain-Specialized LLM for Brazilian E-Commerce Analysis
Project Status: v2 Complete β v3 Planned
Model: Qwen3-3.7B β SFT β GRPO alignment
Domain: Brazilian e-commerce (sentiment analysis, churn prediction, SQL generation, structured extraction)
Infrastructure: Vertex AI Workbench, NVIDIA L4 (24GB), Unsloth + TRL 0.24.0
Tracking: W&B project tferrazrafael-self/tucano2-commerce
1. Problem Statement
Brazilian e-commerce companies need automated analysis of customer reviews, churn prediction, and business intelligence generation β all in Portuguese. General-purpose LLMs (GPT-4o, Claude) are:
- Expensive at scale β API costs of ~$0.01/analysis Γ thousands of daily reviews
- Not domain-optimized β miss Brazilian Portuguese idioms, e-commerce-specific patterns
- Not self-hosted β customer data leaves the organization for every API call
Goal: Build a compact (3.7B parameter) model that matches or exceeds large general models on e-commerce-specific tasks, runs on a single GPU, and keeps data on-premise.
2. Context & Approach
Architecture Decision: SFT + GRPO
The training pipeline follows the DeepSeek-R1 paradigm (arxiv: 2501.12948):
Qwen3-3.7B (base) β SFT (domain adaptation) β GRPO (alignment via reward signals)
Why Qwen3-3.7B:
- Strong multilingual base with Portuguese capability
- 3.7B parameters fits in 24GB VRAM with 4-bit quantization (Unsloth NF4)
- Qwen3 architecture includes native
<think>reasoning mode
Why GRPO over DPO/PPO:
- No need for a separate reward model (rule-based rewards suffice for structured tasks)
- Group-relative optimization naturally handles multi-task reward distributions
- Published results show GRPO working well at this model scale (Dr. GRPO, Skywork-OR1)
Why rule-based rewards:
- E-commerce tasks have verifiable outputs (JSON schema adherence, SQL execution, sentiment polarity)
- Neural reward models introduce reward hacking at small scale
- DeepSeek-R1 demonstrated rule-based rewards outperform neural reward models for structured tasks
Task Portfolio
| Task | Input | Output | Evaluation |
|---|---|---|---|
| Structured Extraction | Customer review + metadata | JSON with 10 fields | Field-level match |
| Sentiment Analysis | Review text | Polarity + score | Accuracy + F1 |
| SQL Generation | Business question | Executable SQL query | Execution accuracy |
| Churn Prediction | Customer profile | Risk score + reasoning | Binary accuracy |
| Business Insights | Open-ended question | Analytical report in PT-BR | LLM-as-judge |
Infrastructure
- Training: Vertex AI Workbench, single NVIDIA L4 (24GB VRAM)
- Quantization: Unsloth NF4 (4-bit) for training β enables 3.7B model to fit in 24GB
- Framework: TRL 0.24.0 (pinned for Unsloth compatibility),
UnslothGRPOTrainer - Monitoring: Weights & Biases
3. Decision Log
Decision 1: Continuous vs. Binary Reward Functions
- Context: Initial reward functions used binary (0/1) scoring
- Problem: 50% of training steps showed
reward_std=0andloss=0β no learning signal - Decision: Rewrote all 4 reward functions with continuous scoring (0.0β1.0), partial credit for partially correct outputs
- Consequence: Zero-std steps dropped from 50% to ~10%; loss became consistently non-zero
- Reference: Dr. GRPO paper (2503.20783) proves std-based normalization amplifies this issue
Decision 2: Temperature 0.8 β 1.0
- Context: Model's
generation_config.jsonhadtemperature=0.1(default from Qwen3) - Problem: All 8 GRPO completions were near-identical β zero reward variance β zero advantage β zero gradient.
frac_reward_zero_std=1.0on every step. First full run was killed. - Decision v2: Set
temperature=0.8in GRPOConfig - Outcome v2: Fixed the zero-std catastrophe. Training ran 210 steps, eval improved 50% (0.083β0.125)
- Decision v3 (planned): Increase to
temperature=1.0β all published GRPO papers (DeepSeek-R1, Dr. GRPO, Skywork-OR1) use 1.0. Higher temperature further delays entropy collapse. - Reference: Skywork-OR1 (2505.22312) ablation: Ο=1.0 gives 5-8% better test performance than Ο=0.6
Decision 3: scale_rewards=False (Dr. GRPO)
- Context: Standard GRPO normalizes advantages by
std(rewards)per group - Problem: When one group has low variance, dividing by a small std inflates its gradient contribution β training instability and bias toward "easy" prompts
- Decision: Disabled std normalization following Dr. GRPO paper
- Consequence: More stable training; combined with continuous rewards, eliminated most zero-gradient steps
- Reference: Dr. GRPO (2503.20783) achieved SOTA 43.3% on AIME 2024 with a 7B model using this fix
Decision 4: Early Stopping Configuration
- Context (v2 run 1):
EARLY_STOPPING_PATIENCE=3,EVAL_STEPS=10,EVAL_MAX_TOKENS=256 - Problem: Model killed at step 40. Eval tokens too short β model needs 500-700 tokens just for
</think>. Eval was scoring incomplete generations. Only 30 steps of runway before early stopping fired. - Decision (v2 run 2):
PATIENCE=10,EVAL_MAX_TOKENS=2048,EVAL_MAX_SAMPLES=5 - Consequence: Training ran to step 210, early stopping fired correctly when eval plateaued
- Lesson: Early stopping parameters must account for the model's generation length requirements
Decision 5: MAX_STEPS vs. NUM_EPOCHS
- Context: User set
NUM_EPOCHS=2,MAX_STEPS=300 - Clarification: In TRL,
MAX_STEPStakes absolute priority. With 300 prompts Γ 8 generations / (batch_size=4 Γ grad_accum=2) = 300 steps per epoch.MAX_STEPS=300= exactly one epoch regardless ofNUM_EPOCHS. - Decision: Keep
MAX_STEPS=300for one clean epoch; decide on epoch 2 based on eval trajectory - Consequence: Early stopping at 210 means the model trained through 70% of the data before plateauing
Decision 6: TRL 0.24.0 Pinning
- Context: Unsloth requires specific TRL versions. Upgrading TRL broke vllm/torch dependencies.
- Decision: Pin
trl==0.24.0 --no-depsafter Unsloth installation - Consequence: Stable environment, but locks out newer TRL features (e.g., entropy bonus in GRPOConfig)
- Workaround for v3: Implement entropy control via custom callback or trainer subclass
4. Training Results
v2 Final Run (Step 210/300, Early Stopped)
| Metric | Value | Assessment |
|---|---|---|
eval/best_reward_final |
0.125 | +50% from starting 0.083 |
train/reward |
0.285 | Below SFT calibration (0.38) |
train/frac_reward_zero_std |
0.0 | β Fixed (was 1.0 in v1) |
train/kl |
0.004 | Very conservative policy shift |
train/clip_ratio |
0.0 (all) | β οΈ Entropy collapse β policy never hit clip bounds |
train/completion_length |
2048 (= max) | β οΈ Every completion truncated |
train/grad_norm |
0.030 | Stable |
| Duration | 14.9 hours | 210 steps Γ ~4.3 min/step |
Validation Results (5 held-out prompts)
| Sample | Task | Reward | Notes |
|---|---|---|---|
| 1 | Extraction (JSON) | 0.12 | Fields incorrect, output truncated |
| 2 | Insights (categories) | 0.70 | Coherent PT-BR, structured headers |
| 3 | Retention analysis | 0.70 | Step-by-step methodology |
| 4 | Reengagement decision | 0.50 | <think> reasoning visible, contextual |
| 5 | Regional comparison | 0.70 | Comparative framework |
Mean validation reward: 0.54 vs SFT calibration baseline of 0.38 β +42% improvement
Bimodal Performance Pattern
- Strong (0.50β0.70): Open-ended analysis, insights, comparison tasks
- Weak (0.12): Structured JSON extraction β the completion ceiling blocks the model from outputting complete JSON
5. Diagnosed Issues & Root Causes
Issue 1: Entropy Collapse (Critical)
- Symptom:
clip_ratio=0on all steps, KL=0.004 - Root cause: Policy entropy drops to near-zero β all 8 rollouts produce identical output β zero advantage β zero gradient (Skywork-OR1, 2505.22312)
- Fix (v3): Temperature=1.0, add entropy loss coefficient (Ξ±=5e-3), filter zero-advantage groups
Issue 2: Completion Length Ceiling (Critical)
- Symptom:
completion_length=2048(=max_completion_length) on every step - Root cause: GRPO length bias inflates incorrect response length (Dr. GRPO, 2503.20783 Β§3.1). Model can't finish reasoning β gets low rewards β weak signal
- Fix (v3): Increase
max_completion_lengthto 4096, reducenum_generations8β4 to fit VRAM
Issue 3: Data Scale (Moderate)
- Symptom: Early stopping at step 210 (70% of epoch), eval plateaued at 0.125
- Root cause: 300 prompts is below published minimums (1Kβ600K in literature)
- Fix (v3): Expand to 1000+ prompts via synthetic generation and data augmentation
6. Lessons Learned
Technical Lessons
Default model configs kill RL training. Qwen3's
generation_config.jsonsetstemperature=0.1. This single default destroyed the first full training run. Always override generation parameters explicitly.Reward function design is the core ML engineering task. Binary rewards β zero signal. Continuous rewards β training works. Multi-component rewards (format + content + quality) β staged learning where format converges first. The reward function IS the product specification.
GRPO needs diversity to learn. The algorithm is fundamentally about comparing different completions. Anything that reduces diversity (low temperature, small group size, few prompts, completion ceiling) directly reduces learning signal.
TRL step calculation is non-obvious.
steps = num_prompts Γ num_generations / (batch_size Γ grad_accum). Missing thenum_generationsmultiplier gives wrong epoch estimates.MAX_STEPSalways overridesNUM_EPOCHS.Early stopping needs tuning for generative models. Patience must account for eval generation length. Short eval tokens β incomplete outputs β flat eval scores β premature stopping.
Entropy collapse is the GRPO failure mode. Not divergence, not reward hacking β the model collapses to deterministic output. Monitoring
clip_ratioand generation entropy is essential.
Business Lessons
Domain data is the moat, not model size. ThinkJSON (1.5B) beats DeepSeek-R1 (671B) on JSON extraction. A 7B model beats o3-mini on SQL. 300 Portuguese e-commerce examples already produced a model that outperforms SFT baseline by 42%.
Cost arbitrage is immediate. A self-hosted 3.7B model on a $0.50/hr GPU costs ~$0.001/analysis vs $0.01+ for API calls. Breakeven at ~100 analyses/day.
Privacy is a feature. Self-hosted means customer data never leaves the organization. This matters for LGPD (Brazilian data protection law) compliance.
Portuguese-first is defensible. Most LLM development is English-first. A model that deeply understands Brazilian e-commerce Portuguese ("veio com defeito", "nota 1 estrela") has a real competitive advantage.
Budget 3-5 iterations, not 1. The first run is diagnostic. v1 found the zero-signal bug. v2 found the temperature bug and completion ceiling. v3 will address entropy collapse. Each iteration is cheaper than the last because you know what to measure.
7. Next Steps
See docs/ADR-001-next-steps.md for detailed execution plans.
Priority 1: Build Domain Benchmark (1-2 days)
50-100 held-out prompts, automated scoring, establish baselines
Priority 2: Run Comparison vs. Qwen3-35B-A3B (1 day)
Prove small tuned model matches/beats large general model on domain tasks
Priority 3: GRPO v3 Training Run (2-3 days)
Fix entropy collapse, increase completion length, expand training data
References
| Paper | Key Finding | Relevance |
|---|---|---|
| DeepSeek-R1 (2501.12948) | SFTβGRPO pipeline, rule-based rewards | Architecture template |
| Dr. GRPO (2503.20783) | Remove std normalization, remove length bias | Fixes reward scaling |
| Skywork-OR1 MAGIC (2505.22312) | Entropy collapse diagnosis and fix | Explains clip_ratio=0 |
| MC-GRPO (2601.22582) | Median baseline for small rollout budgets | Fixes G=8 noise |
| ThinkJSON (2502.14905) | 1.5B beats 671B on JSON extraction | Proves domain specialization thesis |
| Reasoning-SQL (2503.23157) | 7B beats o3-mini on SQL with GRPO | Proves GRPO works for SQL |
| Cocktail Effect (2410.01109) | Multi-task SFT + general data boosts domain performance | SFT improvement recipe |