tucano2-commerce / docs /PROJECT.md
rtferraz's picture
docs: add project documentation
aa71b0c verified

Tucano2-Commerce: Domain-Specialized LLM for Brazilian E-Commerce Analysis

Project Status: v2 Complete β€” v3 Planned

Model: Qwen3-3.7B β†’ SFT β†’ GRPO alignment
Domain: Brazilian e-commerce (sentiment analysis, churn prediction, SQL generation, structured extraction)
Infrastructure: Vertex AI Workbench, NVIDIA L4 (24GB), Unsloth + TRL 0.24.0
Tracking: W&B project tferrazrafael-self/tucano2-commerce


1. Problem Statement

Brazilian e-commerce companies need automated analysis of customer reviews, churn prediction, and business intelligence generation β€” all in Portuguese. General-purpose LLMs (GPT-4o, Claude) are:

  1. Expensive at scale β€” API costs of ~$0.01/analysis Γ— thousands of daily reviews
  2. Not domain-optimized β€” miss Brazilian Portuguese idioms, e-commerce-specific patterns
  3. Not self-hosted β€” customer data leaves the organization for every API call

Goal: Build a compact (3.7B parameter) model that matches or exceeds large general models on e-commerce-specific tasks, runs on a single GPU, and keeps data on-premise.


2. Context & Approach

Architecture Decision: SFT + GRPO

The training pipeline follows the DeepSeek-R1 paradigm (arxiv: 2501.12948):

Qwen3-3.7B (base) β†’ SFT (domain adaptation) β†’ GRPO (alignment via reward signals)

Why Qwen3-3.7B:

  • Strong multilingual base with Portuguese capability
  • 3.7B parameters fits in 24GB VRAM with 4-bit quantization (Unsloth NF4)
  • Qwen3 architecture includes native <think> reasoning mode

Why GRPO over DPO/PPO:

  • No need for a separate reward model (rule-based rewards suffice for structured tasks)
  • Group-relative optimization naturally handles multi-task reward distributions
  • Published results show GRPO working well at this model scale (Dr. GRPO, Skywork-OR1)

Why rule-based rewards:

  • E-commerce tasks have verifiable outputs (JSON schema adherence, SQL execution, sentiment polarity)
  • Neural reward models introduce reward hacking at small scale
  • DeepSeek-R1 demonstrated rule-based rewards outperform neural reward models for structured tasks

Task Portfolio

Task Input Output Evaluation
Structured Extraction Customer review + metadata JSON with 10 fields Field-level match
Sentiment Analysis Review text Polarity + score Accuracy + F1
SQL Generation Business question Executable SQL query Execution accuracy
Churn Prediction Customer profile Risk score + reasoning Binary accuracy
Business Insights Open-ended question Analytical report in PT-BR LLM-as-judge

Infrastructure

  • Training: Vertex AI Workbench, single NVIDIA L4 (24GB VRAM)
  • Quantization: Unsloth NF4 (4-bit) for training β€” enables 3.7B model to fit in 24GB
  • Framework: TRL 0.24.0 (pinned for Unsloth compatibility), UnslothGRPOTrainer
  • Monitoring: Weights & Biases

3. Decision Log

Decision 1: Continuous vs. Binary Reward Functions

  • Context: Initial reward functions used binary (0/1) scoring
  • Problem: 50% of training steps showed reward_std=0 and loss=0 β€” no learning signal
  • Decision: Rewrote all 4 reward functions with continuous scoring (0.0–1.0), partial credit for partially correct outputs
  • Consequence: Zero-std steps dropped from 50% to ~10%; loss became consistently non-zero
  • Reference: Dr. GRPO paper (2503.20783) proves std-based normalization amplifies this issue

Decision 2: Temperature 0.8 β†’ 1.0

  • Context: Model's generation_config.json had temperature=0.1 (default from Qwen3)
  • Problem: All 8 GRPO completions were near-identical β†’ zero reward variance β†’ zero advantage β†’ zero gradient. frac_reward_zero_std=1.0 on every step. First full run was killed.
  • Decision v2: Set temperature=0.8 in GRPOConfig
  • Outcome v2: Fixed the zero-std catastrophe. Training ran 210 steps, eval improved 50% (0.083β†’0.125)
  • Decision v3 (planned): Increase to temperature=1.0 β€” all published GRPO papers (DeepSeek-R1, Dr. GRPO, Skywork-OR1) use 1.0. Higher temperature further delays entropy collapse.
  • Reference: Skywork-OR1 (2505.22312) ablation: Ο„=1.0 gives 5-8% better test performance than Ο„=0.6

Decision 3: scale_rewards=False (Dr. GRPO)

  • Context: Standard GRPO normalizes advantages by std(rewards) per group
  • Problem: When one group has low variance, dividing by a small std inflates its gradient contribution β†’ training instability and bias toward "easy" prompts
  • Decision: Disabled std normalization following Dr. GRPO paper
  • Consequence: More stable training; combined with continuous rewards, eliminated most zero-gradient steps
  • Reference: Dr. GRPO (2503.20783) achieved SOTA 43.3% on AIME 2024 with a 7B model using this fix

Decision 4: Early Stopping Configuration

  • Context (v2 run 1): EARLY_STOPPING_PATIENCE=3, EVAL_STEPS=10, EVAL_MAX_TOKENS=256
  • Problem: Model killed at step 40. Eval tokens too short β€” model needs 500-700 tokens just for </think>. Eval was scoring incomplete generations. Only 30 steps of runway before early stopping fired.
  • Decision (v2 run 2): PATIENCE=10, EVAL_MAX_TOKENS=2048, EVAL_MAX_SAMPLES=5
  • Consequence: Training ran to step 210, early stopping fired correctly when eval plateaued
  • Lesson: Early stopping parameters must account for the model's generation length requirements

Decision 5: MAX_STEPS vs. NUM_EPOCHS

  • Context: User set NUM_EPOCHS=2, MAX_STEPS=300
  • Clarification: In TRL, MAX_STEPS takes absolute priority. With 300 prompts Γ— 8 generations / (batch_size=4 Γ— grad_accum=2) = 300 steps per epoch. MAX_STEPS=300 = exactly one epoch regardless of NUM_EPOCHS.
  • Decision: Keep MAX_STEPS=300 for one clean epoch; decide on epoch 2 based on eval trajectory
  • Consequence: Early stopping at 210 means the model trained through 70% of the data before plateauing

Decision 6: TRL 0.24.0 Pinning

  • Context: Unsloth requires specific TRL versions. Upgrading TRL broke vllm/torch dependencies.
  • Decision: Pin trl==0.24.0 --no-deps after Unsloth installation
  • Consequence: Stable environment, but locks out newer TRL features (e.g., entropy bonus in GRPOConfig)
  • Workaround for v3: Implement entropy control via custom callback or trainer subclass

4. Training Results

v2 Final Run (Step 210/300, Early Stopped)

Metric Value Assessment
eval/best_reward_final 0.125 +50% from starting 0.083
train/reward 0.285 Below SFT calibration (0.38)
train/frac_reward_zero_std 0.0 βœ… Fixed (was 1.0 in v1)
train/kl 0.004 Very conservative policy shift
train/clip_ratio 0.0 (all) ⚠️ Entropy collapse β€” policy never hit clip bounds
train/completion_length 2048 (= max) ⚠️ Every completion truncated
train/grad_norm 0.030 Stable
Duration 14.9 hours 210 steps Γ— ~4.3 min/step

Validation Results (5 held-out prompts)

Sample Task Reward Notes
1 Extraction (JSON) 0.12 Fields incorrect, output truncated
2 Insights (categories) 0.70 Coherent PT-BR, structured headers
3 Retention analysis 0.70 Step-by-step methodology
4 Reengagement decision 0.50 <think> reasoning visible, contextual
5 Regional comparison 0.70 Comparative framework

Mean validation reward: 0.54 vs SFT calibration baseline of 0.38 β†’ +42% improvement

Bimodal Performance Pattern

  • Strong (0.50–0.70): Open-ended analysis, insights, comparison tasks
  • Weak (0.12): Structured JSON extraction β€” the completion ceiling blocks the model from outputting complete JSON

5. Diagnosed Issues & Root Causes

Issue 1: Entropy Collapse (Critical)

  • Symptom: clip_ratio=0 on all steps, KL=0.004
  • Root cause: Policy entropy drops to near-zero β†’ all 8 rollouts produce identical output β†’ zero advantage β†’ zero gradient (Skywork-OR1, 2505.22312)
  • Fix (v3): Temperature=1.0, add entropy loss coefficient (Ξ±=5e-3), filter zero-advantage groups

Issue 2: Completion Length Ceiling (Critical)

  • Symptom: completion_length=2048 (= max_completion_length) on every step
  • Root cause: GRPO length bias inflates incorrect response length (Dr. GRPO, 2503.20783 Β§3.1). Model can't finish reasoning β†’ gets low rewards β†’ weak signal
  • Fix (v3): Increase max_completion_length to 4096, reduce num_generations 8β†’4 to fit VRAM

Issue 3: Data Scale (Moderate)

  • Symptom: Early stopping at step 210 (70% of epoch), eval plateaued at 0.125
  • Root cause: 300 prompts is below published minimums (1K–600K in literature)
  • Fix (v3): Expand to 1000+ prompts via synthetic generation and data augmentation

6. Lessons Learned

Technical Lessons

  1. Default model configs kill RL training. Qwen3's generation_config.json sets temperature=0.1. This single default destroyed the first full training run. Always override generation parameters explicitly.

  2. Reward function design is the core ML engineering task. Binary rewards β†’ zero signal. Continuous rewards β†’ training works. Multi-component rewards (format + content + quality) β†’ staged learning where format converges first. The reward function IS the product specification.

  3. GRPO needs diversity to learn. The algorithm is fundamentally about comparing different completions. Anything that reduces diversity (low temperature, small group size, few prompts, completion ceiling) directly reduces learning signal.

  4. TRL step calculation is non-obvious. steps = num_prompts Γ— num_generations / (batch_size Γ— grad_accum). Missing the num_generations multiplier gives wrong epoch estimates. MAX_STEPS always overrides NUM_EPOCHS.

  5. Early stopping needs tuning for generative models. Patience must account for eval generation length. Short eval tokens β†’ incomplete outputs β†’ flat eval scores β†’ premature stopping.

  6. Entropy collapse is the GRPO failure mode. Not divergence, not reward hacking β€” the model collapses to deterministic output. Monitoring clip_ratio and generation entropy is essential.

Business Lessons

  1. Domain data is the moat, not model size. ThinkJSON (1.5B) beats DeepSeek-R1 (671B) on JSON extraction. A 7B model beats o3-mini on SQL. 300 Portuguese e-commerce examples already produced a model that outperforms SFT baseline by 42%.

  2. Cost arbitrage is immediate. A self-hosted 3.7B model on a $0.50/hr GPU costs ~$0.001/analysis vs $0.01+ for API calls. Breakeven at ~100 analyses/day.

  3. Privacy is a feature. Self-hosted means customer data never leaves the organization. This matters for LGPD (Brazilian data protection law) compliance.

  4. Portuguese-first is defensible. Most LLM development is English-first. A model that deeply understands Brazilian e-commerce Portuguese ("veio com defeito", "nota 1 estrela") has a real competitive advantage.

  5. Budget 3-5 iterations, not 1. The first run is diagnostic. v1 found the zero-signal bug. v2 found the temperature bug and completion ceiling. v3 will address entropy collapse. Each iteration is cheaper than the last because you know what to measure.


7. Next Steps

See docs/ADR-001-next-steps.md for detailed execution plans.

Priority 1: Build Domain Benchmark (1-2 days)

50-100 held-out prompts, automated scoring, establish baselines

Priority 2: Run Comparison vs. Qwen3-35B-A3B (1 day)

Prove small tuned model matches/beats large general model on domain tasks

Priority 3: GRPO v3 Training Run (2-3 days)

Fix entropy collapse, increase completion length, expand training data


References

Paper Key Finding Relevance
DeepSeek-R1 (2501.12948) SFT→GRPO pipeline, rule-based rewards Architecture template
Dr. GRPO (2503.20783) Remove std normalization, remove length bias Fixes reward scaling
Skywork-OR1 MAGIC (2505.22312) Entropy collapse diagnosis and fix Explains clip_ratio=0
MC-GRPO (2601.22582) Median baseline for small rollout budgets Fixes G=8 noise
ThinkJSON (2502.14905) 1.5B beats 671B on JSON extraction Proves domain specialization thesis
Reasoning-SQL (2503.23157) 7B beats o3-mini on SQL with GRPO Proves GRPO works for SQL
Cocktail Effect (2410.01109) Multi-task SFT + general data boosts domain performance SFT improvement recipe