File size: 12,675 Bytes
aa71b0c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 | # Tucano2-Commerce: Domain-Specialized LLM for Brazilian E-Commerce Analysis
## Project Status: v2 Complete β v3 Planned
**Model:** Qwen3-3.7B β SFT β GRPO alignment
**Domain:** Brazilian e-commerce (sentiment analysis, churn prediction, SQL generation, structured extraction)
**Infrastructure:** Vertex AI Workbench, NVIDIA L4 (24GB), Unsloth + TRL 0.24.0
**Tracking:** W&B project `tferrazrafael-self/tucano2-commerce`
---
## 1. Problem Statement
Brazilian e-commerce companies need automated analysis of customer reviews, churn prediction, and business intelligence generation β all in Portuguese. General-purpose LLMs (GPT-4o, Claude) are:
1. **Expensive at scale** β API costs of ~$0.01/analysis Γ thousands of daily reviews
2. **Not domain-optimized** β miss Brazilian Portuguese idioms, e-commerce-specific patterns
3. **Not self-hosted** β customer data leaves the organization for every API call
**Goal:** Build a compact (3.7B parameter) model that matches or exceeds large general models on e-commerce-specific tasks, runs on a single GPU, and keeps data on-premise.
---
## 2. Context & Approach
### Architecture Decision: SFT + GRPO
The training pipeline follows the DeepSeek-R1 paradigm (arxiv: 2501.12948):
```
Qwen3-3.7B (base) β SFT (domain adaptation) β GRPO (alignment via reward signals)
```
**Why Qwen3-3.7B:**
- Strong multilingual base with Portuguese capability
- 3.7B parameters fits in 24GB VRAM with 4-bit quantization (Unsloth NF4)
- Qwen3 architecture includes native `<think>` reasoning mode
**Why GRPO over DPO/PPO:**
- No need for a separate reward model (rule-based rewards suffice for structured tasks)
- Group-relative optimization naturally handles multi-task reward distributions
- Published results show GRPO working well at this model scale (Dr. GRPO, Skywork-OR1)
**Why rule-based rewards:**
- E-commerce tasks have verifiable outputs (JSON schema adherence, SQL execution, sentiment polarity)
- Neural reward models introduce reward hacking at small scale
- DeepSeek-R1 demonstrated rule-based rewards outperform neural reward models for structured tasks
### Task Portfolio
| Task | Input | Output | Evaluation |
|------|-------|--------|------------|
| Structured Extraction | Customer review + metadata | JSON with 10 fields | Field-level match |
| Sentiment Analysis | Review text | Polarity + score | Accuracy + F1 |
| SQL Generation | Business question | Executable SQL query | Execution accuracy |
| Churn Prediction | Customer profile | Risk score + reasoning | Binary accuracy |
| Business Insights | Open-ended question | Analytical report in PT-BR | LLM-as-judge |
### Infrastructure
- **Training:** Vertex AI Workbench, single NVIDIA L4 (24GB VRAM)
- **Quantization:** Unsloth NF4 (4-bit) for training β enables 3.7B model to fit in 24GB
- **Framework:** TRL 0.24.0 (pinned for Unsloth compatibility), `UnslothGRPOTrainer`
- **Monitoring:** Weights & Biases
---
## 3. Decision Log
### Decision 1: Continuous vs. Binary Reward Functions
- **Context:** Initial reward functions used binary (0/1) scoring
- **Problem:** 50% of training steps showed `reward_std=0` and `loss=0` β no learning signal
- **Decision:** Rewrote all 4 reward functions with continuous scoring (0.0β1.0), partial credit for partially correct outputs
- **Consequence:** Zero-std steps dropped from 50% to ~10%; loss became consistently non-zero
- **Reference:** Dr. GRPO paper (2503.20783) proves std-based normalization amplifies this issue
### Decision 2: Temperature 0.8 β 1.0
- **Context:** Model's `generation_config.json` had `temperature=0.1` (default from Qwen3)
- **Problem:** All 8 GRPO completions were near-identical β zero reward variance β zero advantage β zero gradient. `frac_reward_zero_std=1.0` on every step. First full run was killed.
- **Decision v2:** Set `temperature=0.8` in GRPOConfig
- **Outcome v2:** Fixed the zero-std catastrophe. Training ran 210 steps, eval improved 50% (0.083β0.125)
- **Decision v3 (planned):** Increase to `temperature=1.0` β all published GRPO papers (DeepSeek-R1, Dr. GRPO, Skywork-OR1) use 1.0. Higher temperature further delays entropy collapse.
- **Reference:** Skywork-OR1 (2505.22312) ablation: Ο=1.0 gives 5-8% better test performance than Ο=0.6
### Decision 3: `scale_rewards=False` (Dr. GRPO)
- **Context:** Standard GRPO normalizes advantages by `std(rewards)` per group
- **Problem:** When one group has low variance, dividing by a small std inflates its gradient contribution β training instability and bias toward "easy" prompts
- **Decision:** Disabled std normalization following Dr. GRPO paper
- **Consequence:** More stable training; combined with continuous rewards, eliminated most zero-gradient steps
- **Reference:** Dr. GRPO (2503.20783) achieved SOTA 43.3% on AIME 2024 with a 7B model using this fix
### Decision 4: Early Stopping Configuration
- **Context (v2 run 1):** `EARLY_STOPPING_PATIENCE=3`, `EVAL_STEPS=10`, `EVAL_MAX_TOKENS=256`
- **Problem:** Model killed at step 40. Eval tokens too short β model needs 500-700 tokens just for `</think>`. Eval was scoring incomplete generations. Only 30 steps of runway before early stopping fired.
- **Decision (v2 run 2):** `PATIENCE=10`, `EVAL_MAX_TOKENS=2048`, `EVAL_MAX_SAMPLES=5`
- **Consequence:** Training ran to step 210, early stopping fired correctly when eval plateaued
- **Lesson:** Early stopping parameters must account for the model's generation length requirements
### Decision 5: `MAX_STEPS` vs. `NUM_EPOCHS`
- **Context:** User set `NUM_EPOCHS=2`, `MAX_STEPS=300`
- **Clarification:** In TRL, `MAX_STEPS` takes absolute priority. With 300 prompts Γ 8 generations / (batch_size=4 Γ grad_accum=2) = 300 steps per epoch. `MAX_STEPS=300` = exactly one epoch regardless of `NUM_EPOCHS`.
- **Decision:** Keep `MAX_STEPS=300` for one clean epoch; decide on epoch 2 based on eval trajectory
- **Consequence:** Early stopping at 210 means the model trained through 70% of the data before plateauing
### Decision 6: TRL 0.24.0 Pinning
- **Context:** Unsloth requires specific TRL versions. Upgrading TRL broke vllm/torch dependencies.
- **Decision:** Pin `trl==0.24.0 --no-deps` after Unsloth installation
- **Consequence:** Stable environment, but locks out newer TRL features (e.g., entropy bonus in GRPOConfig)
- **Workaround for v3:** Implement entropy control via custom callback or trainer subclass
---
## 4. Training Results
### v2 Final Run (Step 210/300, Early Stopped)
| Metric | Value | Assessment |
|--------|-------|------------|
| `eval/best_reward_final` | 0.125 | +50% from starting 0.083 |
| `train/reward` | 0.285 | Below SFT calibration (0.38) |
| `train/frac_reward_zero_std` | 0.0 | β
Fixed (was 1.0 in v1) |
| `train/kl` | 0.004 | Very conservative policy shift |
| `train/clip_ratio` | 0.0 (all) | β οΈ Entropy collapse β policy never hit clip bounds |
| `train/completion_length` | 2048 (= max) | β οΈ Every completion truncated |
| `train/grad_norm` | 0.030 | Stable |
| Duration | 14.9 hours | 210 steps Γ ~4.3 min/step |
### Validation Results (5 held-out prompts)
| Sample | Task | Reward | Notes |
|--------|------|--------|-------|
| 1 | Extraction (JSON) | 0.12 | Fields incorrect, output truncated |
| 2 | Insights (categories) | 0.70 | Coherent PT-BR, structured headers |
| 3 | Retention analysis | 0.70 | Step-by-step methodology |
| 4 | Reengagement decision | 0.50 | `<think>` reasoning visible, contextual |
| 5 | Regional comparison | 0.70 | Comparative framework |
**Mean validation reward: 0.54** vs SFT calibration baseline of **0.38** β **+42% improvement**
### Bimodal Performance Pattern
- **Strong (0.50β0.70):** Open-ended analysis, insights, comparison tasks
- **Weak (0.12):** Structured JSON extraction β the completion ceiling blocks the model from outputting complete JSON
---
## 5. Diagnosed Issues & Root Causes
### Issue 1: Entropy Collapse (Critical)
- **Symptom:** `clip_ratio=0` on all steps, KL=0.004
- **Root cause:** Policy entropy drops to near-zero β all 8 rollouts produce identical output β zero advantage β zero gradient (Skywork-OR1, 2505.22312)
- **Fix (v3):** Temperature=1.0, add entropy loss coefficient (Ξ±=5e-3), filter zero-advantage groups
### Issue 2: Completion Length Ceiling (Critical)
- **Symptom:** `completion_length=2048` (= `max_completion_length`) on every step
- **Root cause:** GRPO length bias inflates incorrect response length (Dr. GRPO, 2503.20783 Β§3.1). Model can't finish reasoning β gets low rewards β weak signal
- **Fix (v3):** Increase `max_completion_length` to 4096, reduce `num_generations` 8β4 to fit VRAM
### Issue 3: Data Scale (Moderate)
- **Symptom:** Early stopping at step 210 (70% of epoch), eval plateaued at 0.125
- **Root cause:** 300 prompts is below published minimums (1Kβ600K in literature)
- **Fix (v3):** Expand to 1000+ prompts via synthetic generation and data augmentation
---
## 6. Lessons Learned
### Technical Lessons
1. **Default model configs kill RL training.** Qwen3's `generation_config.json` sets `temperature=0.1`. This single default destroyed the first full training run. Always override generation parameters explicitly.
2. **Reward function design is the core ML engineering task.** Binary rewards β zero signal. Continuous rewards β training works. Multi-component rewards (format + content + quality) β staged learning where format converges first. The reward function IS the product specification.
3. **GRPO needs diversity to learn.** The algorithm is fundamentally about comparing different completions. Anything that reduces diversity (low temperature, small group size, few prompts, completion ceiling) directly reduces learning signal.
4. **TRL step calculation is non-obvious.** `steps = num_prompts Γ num_generations / (batch_size Γ grad_accum)`. Missing the `num_generations` multiplier gives wrong epoch estimates. `MAX_STEPS` always overrides `NUM_EPOCHS`.
5. **Early stopping needs tuning for generative models.** Patience must account for eval generation length. Short eval tokens β incomplete outputs β flat eval scores β premature stopping.
6. **Entropy collapse is the GRPO failure mode.** Not divergence, not reward hacking β the model collapses to deterministic output. Monitoring `clip_ratio` and generation entropy is essential.
### Business Lessons
1. **Domain data is the moat, not model size.** ThinkJSON (1.5B) beats DeepSeek-R1 (671B) on JSON extraction. A 7B model beats o3-mini on SQL. 300 Portuguese e-commerce examples already produced a model that outperforms SFT baseline by 42%.
2. **Cost arbitrage is immediate.** A self-hosted 3.7B model on a $0.50/hr GPU costs ~$0.001/analysis vs $0.01+ for API calls. Breakeven at ~100 analyses/day.
3. **Privacy is a feature.** Self-hosted means customer data never leaves the organization. This matters for LGPD (Brazilian data protection law) compliance.
4. **Portuguese-first is defensible.** Most LLM development is English-first. A model that deeply understands Brazilian e-commerce Portuguese ("veio com defeito", "nota 1 estrela") has a real competitive advantage.
5. **Budget 3-5 iterations, not 1.** The first run is diagnostic. v1 found the zero-signal bug. v2 found the temperature bug and completion ceiling. v3 will address entropy collapse. Each iteration is cheaper than the last because you know what to measure.
---
## 7. Next Steps
See `docs/ADR-001-next-steps.md` for detailed execution plans.
### Priority 1: Build Domain Benchmark (1-2 days)
50-100 held-out prompts, automated scoring, establish baselines
### Priority 2: Run Comparison vs. Qwen3-35B-A3B (1 day)
Prove small tuned model matches/beats large general model on domain tasks
### Priority 3: GRPO v3 Training Run (2-3 days)
Fix entropy collapse, increase completion length, expand training data
---
## References
| Paper | Key Finding | Relevance |
|-------|------------|-----------|
| DeepSeek-R1 (2501.12948) | SFTβGRPO pipeline, rule-based rewards | Architecture template |
| Dr. GRPO (2503.20783) | Remove std normalization, remove length bias | Fixes reward scaling |
| Skywork-OR1 MAGIC (2505.22312) | Entropy collapse diagnosis and fix | Explains clip_ratio=0 |
| MC-GRPO (2601.22582) | Median baseline for small rollout budgets | Fixes G=8 noise |
| ThinkJSON (2502.14905) | 1.5B beats 671B on JSON extraction | Proves domain specialization thesis |
| Reasoning-SQL (2503.23157) | 7B beats o3-mini on SQL with GRPO | Proves GRPO works for SQL |
| Cocktail Effect (2410.01109) | Multi-task SFT + general data boosts domain performance | SFT improvement recipe |
|