docs: add project documentation

aa71b0c verified 14 days ago

12.7 kB

Tucano2-Commerce: Domain-Specialized LLM for Brazilian E-Commerce Analysis

Project Status: v2 Complete — v3 Planned

Model: Qwen3-3.7B → SFT → GRPO alignment
Domain: Brazilian e-commerce (sentiment analysis, churn prediction, SQL generation, structured extraction)
Infrastructure: Vertex AI Workbench, NVIDIA L4 (24GB), Unsloth + TRL 0.24.0
Tracking: W&B project tferrazrafael-self/tucano2-commerce

1. Problem Statement

Brazilian e-commerce companies need automated analysis of customer reviews, churn prediction, and business intelligence generation — all in Portuguese. General-purpose LLMs (GPT-4o, Claude) are:

Expensive at scale — API costs of ~$0.01/analysis × thousands of daily reviews
Not domain-optimized — miss Brazilian Portuguese idioms, e-commerce-specific patterns
Not self-hosted — customer data leaves the organization for every API call

Goal: Build a compact (3.7B parameter) model that matches or exceeds large general models on e-commerce-specific tasks, runs on a single GPU, and keeps data on-premise.

2. Context & Approach

Architecture Decision: SFT + GRPO

The training pipeline follows the DeepSeek-R1 paradigm (arxiv: 2501.12948):

Qwen3-3.7B (base) → SFT (domain adaptation) → GRPO (alignment via reward signals)

Why Qwen3-3.7B:

Strong multilingual base with Portuguese capability
3.7B parameters fits in 24GB VRAM with 4-bit quantization (Unsloth NF4)
Qwen3 architecture includes native <think> reasoning mode

Why GRPO over DPO/PPO:

No need for a separate reward model (rule-based rewards suffice for structured tasks)
Group-relative optimization naturally handles multi-task reward distributions
Published results show GRPO working well at this model scale (Dr. GRPO, Skywork-OR1)

Why rule-based rewards:

E-commerce tasks have verifiable outputs (JSON schema adherence, SQL execution, sentiment polarity)
Neural reward models introduce reward hacking at small scale
DeepSeek-R1 demonstrated rule-based rewards outperform neural reward models for structured tasks

Task Portfolio

Task	Input	Output	Evaluation
Structured Extraction	Customer review + metadata	JSON with 10 fields	Field-level match
Sentiment Analysis	Review text	Polarity + score	Accuracy + F1
SQL Generation	Business question	Executable SQL query	Execution accuracy
Churn Prediction	Customer profile	Risk score + reasoning	Binary accuracy
Business Insights	Open-ended question	Analytical report in PT-BR	LLM-as-judge

Infrastructure

Training: Vertex AI Workbench, single NVIDIA L4 (24GB VRAM)
Quantization: Unsloth NF4 (4-bit) for training — enables 3.7B model to fit in 24GB
Framework: TRL 0.24.0 (pinned for Unsloth compatibility), UnslothGRPOTrainer
Monitoring: Weights & Biases

3. Decision Log

Decision 1: Continuous vs. Binary Reward Functions

Context: Initial reward functions used binary (0/1) scoring
Problem: 50% of training steps showed reward_std=0 and loss=0 — no learning signal
Decision: Rewrote all 4 reward functions with continuous scoring (0.0–1.0), partial credit for partially correct outputs
Consequence: Zero-std steps dropped from 50% to ~10%; loss became consistently non-zero
Reference: Dr. GRPO paper (2503.20783) proves std-based normalization amplifies this issue

Decision 2: Temperature 0.8 → 1.0

Context: Model's generation_config.json had temperature=0.1 (default from Qwen3)
Problem: All 8 GRPO completions were near-identical → zero reward variance → zero advantage → zero gradient. frac_reward_zero_std=1.0 on every step. First full run was killed.
Decision v2: Set temperature=0.8 in GRPOConfig
Outcome v2: Fixed the zero-std catastrophe. Training ran 210 steps, eval improved 50% (0.083→0.125)
Decision v3 (planned): Increase to temperature=1.0 — all published GRPO papers (DeepSeek-R1, Dr. GRPO, Skywork-OR1) use 1.0. Higher temperature further delays entropy collapse.
Reference: Skywork-OR1 (2505.22312) ablation: τ=1.0 gives 5-8% better test performance than τ=0.6

Decision 3: `scale_rewards=False` (Dr. GRPO)

Context: Standard GRPO normalizes advantages by std(rewards) per group
Problem: When one group has low variance, dividing by a small std inflates its gradient contribution → training instability and bias toward "easy" prompts
Decision: Disabled std normalization following Dr. GRPO paper
Consequence: More stable training; combined with continuous rewards, eliminated most zero-gradient steps
Reference: Dr. GRPO (2503.20783) achieved SOTA 43.3% on AIME 2024 with a 7B model using this fix

Decision 4: Early Stopping Configuration

Context (v2 run 1): EARLY_STOPPING_PATIENCE=3, EVAL_STEPS=10, EVAL_MAX_TOKENS=256
Problem: Model killed at step 40. Eval tokens too short — model needs 500-700 tokens just for </think>. Eval was scoring incomplete generations. Only 30 steps of runway before early stopping fired.
Decision (v2 run 2): PATIENCE=10, EVAL_MAX_TOKENS=2048, EVAL_MAX_SAMPLES=5
Consequence: Training ran to step 210, early stopping fired correctly when eval plateaued
Lesson: Early stopping parameters must account for the model's generation length requirements

Decision 5: `MAX_STEPS` vs. `NUM_EPOCHS`

Context: User set NUM_EPOCHS=2, MAX_STEPS=300
Clarification: In TRL, MAX_STEPS takes absolute priority. With 300 prompts × 8 generations / (batch_size=4 × grad_accum=2) = 300 steps per epoch. MAX_STEPS=300 = exactly one epoch regardless of NUM_EPOCHS.
Decision: Keep MAX_STEPS=300 for one clean epoch; decide on epoch 2 based on eval trajectory
Consequence: Early stopping at 210 means the model trained through 70% of the data before plateauing

Decision 6: TRL 0.24.0 Pinning

Context: Unsloth requires specific TRL versions. Upgrading TRL broke vllm/torch dependencies.
Decision: Pin trl==0.24.0 --no-deps after Unsloth installation
Consequence: Stable environment, but locks out newer TRL features (e.g., entropy bonus in GRPOConfig)
Workaround for v3: Implement entropy control via custom callback or trainer subclass

4. Training Results

v2 Final Run (Step 210/300, Early Stopped)

Metric	Value	Assessment
`eval/best_reward_final`	0.125	+50% from starting 0.083
`train/reward`	0.285	Below SFT calibration (0.38)
`train/frac_reward_zero_std`	0.0	✅ Fixed (was 1.0 in v1)
`train/kl`	0.004	Very conservative policy shift
`train/clip_ratio`	0.0 (all)	⚠️ Entropy collapse — policy never hit clip bounds
`train/completion_length`	2048 (= max)	⚠️ Every completion truncated
`train/grad_norm`	0.030	Stable
Duration	14.9 hours	210 steps × ~4.3 min/step

Validation Results (5 held-out prompts)

Sample	Task	Reward	Notes
1	Extraction (JSON)	0.12	Fields incorrect, output truncated
2	Insights (categories)	0.70	Coherent PT-BR, structured headers
3	Retention analysis	0.70	Step-by-step methodology
4	Reengagement decision	0.50	`<think>` reasoning visible, contextual
5	Regional comparison	0.70	Comparative framework

Mean validation reward: 0.54 vs SFT calibration baseline of 0.38 → +42% improvement

Bimodal Performance Pattern

Strong (0.50–0.70): Open-ended analysis, insights, comparison tasks
Weak (0.12): Structured JSON extraction — the completion ceiling blocks the model from outputting complete JSON

5. Diagnosed Issues & Root Causes

Issue 1: Entropy Collapse (Critical)

Symptom: clip_ratio=0 on all steps, KL=0.004
Root cause: Policy entropy drops to near-zero → all 8 rollouts produce identical output → zero advantage → zero gradient (Skywork-OR1, 2505.22312)
Fix (v3): Temperature=1.0, add entropy loss coefficient (α=5e-3), filter zero-advantage groups

Issue 2: Completion Length Ceiling (Critical)

Symptom: completion_length=2048 (= max_completion_length) on every step
Root cause: GRPO length bias inflates incorrect response length (Dr. GRPO, 2503.20783 §3.1). Model can't finish reasoning → gets low rewards → weak signal
Fix (v3): Increase max_completion_length to 4096, reduce num_generations 8→4 to fit VRAM

Issue 3: Data Scale (Moderate)

Symptom: Early stopping at step 210 (70% of epoch), eval plateaued at 0.125
Root cause: 300 prompts is below published minimums (1K–600K in literature)
Fix (v3): Expand to 1000+ prompts via synthetic generation and data augmentation

6. Lessons Learned

Technical Lessons

Default model configs kill RL training. Qwen3's generation_config.json sets temperature=0.1. This single default destroyed the first full training run. Always override generation parameters explicitly.
Reward function design is the core ML engineering task. Binary rewards → zero signal. Continuous rewards → training works. Multi-component rewards (format + content + quality) → staged learning where format converges first. The reward function IS the product specification.
GRPO needs diversity to learn. The algorithm is fundamentally about comparing different completions. Anything that reduces diversity (low temperature, small group size, few prompts, completion ceiling) directly reduces learning signal.
TRL step calculation is non-obvious. steps = num_prompts × num_generations / (batch_size × grad_accum). Missing the num_generations multiplier gives wrong epoch estimates. MAX_STEPS always overrides NUM_EPOCHS.
Early stopping needs tuning for generative models. Patience must account for eval generation length. Short eval tokens → incomplete outputs → flat eval scores → premature stopping.
Entropy collapse is the GRPO failure mode. Not divergence, not reward hacking — the model collapses to deterministic output. Monitoring clip_ratio and generation entropy is essential.

Business Lessons

Domain data is the moat, not model size. ThinkJSON (1.5B) beats DeepSeek-R1 (671B) on JSON extraction. A 7B model beats o3-mini on SQL. 300 Portuguese e-commerce examples already produced a model that outperforms SFT baseline by 42%.
Cost arbitrage is immediate. A self-hosted 3.7B model on a $0.50/hr GPU costs ~$0.001/analysis vs $0.01+ for API calls. Breakeven at ~100 analyses/day.
Privacy is a feature. Self-hosted means customer data never leaves the organization. This matters for LGPD (Brazilian data protection law) compliance.
Portuguese-first is defensible. Most LLM development is English-first. A model that deeply understands Brazilian e-commerce Portuguese ("veio com defeito", "nota 1 estrela") has a real competitive advantage.
Budget 3-5 iterations, not 1. The first run is diagnostic. v1 found the zero-signal bug. v2 found the temperature bug and completion ceiling. v3 will address entropy collapse. Each iteration is cheaper than the last because you know what to measure.

7. Next Steps

See docs/ADR-001-next-steps.md for detailed execution plans.

Priority 1: Build Domain Benchmark (1-2 days)

50-100 held-out prompts, automated scoring, establish baselines

Priority 2: Run Comparison vs. Qwen3-35B-A3B (1 day)

Prove small tuned model matches/beats large general model on domain tasks

Priority 3: GRPO v3 Training Run (2-3 days)

Fix entropy collapse, increase completion length, expand training data

References

Paper	Key Finding	Relevance
DeepSeek-R1 (2501.12948)	SFT→GRPO pipeline, rule-based rewards	Architecture template
Dr. GRPO (2503.20783)	Remove std normalization, remove length bias	Fixes reward scaling
Skywork-OR1 MAGIC (2505.22312)	Entropy collapse diagnosis and fix	Explains clip_ratio=0
MC-GRPO (2601.22582)	Median baseline for small rollout budgets	Fixes G=8 noise
ThinkJSON (2502.14905)	1.5B beats 671B on JSON extraction	Proves domain specialization thesis
Reasoning-SQL (2503.23157)	7B beats o3-mini on SQL with GRPO	Proves GRPO works for SQL
Cocktail Effect (2410.01109)	Multi-task SFT + general data boosts domain performance	SFT improvement recipe