File size: 12,675 Bytes
aa71b0c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
# Tucano2-Commerce: Domain-Specialized LLM for Brazilian E-Commerce Analysis

## Project Status: v2 Complete β€” v3 Planned

**Model:** Qwen3-3.7B β†’ SFT β†’ GRPO alignment  
**Domain:** Brazilian e-commerce (sentiment analysis, churn prediction, SQL generation, structured extraction)  
**Infrastructure:** Vertex AI Workbench, NVIDIA L4 (24GB), Unsloth + TRL 0.24.0  
**Tracking:** W&B project `tferrazrafael-self/tucano2-commerce`

---

## 1. Problem Statement

Brazilian e-commerce companies need automated analysis of customer reviews, churn prediction, and business intelligence generation β€” all in Portuguese. General-purpose LLMs (GPT-4o, Claude) are:

1. **Expensive at scale** β€” API costs of ~$0.01/analysis Γ— thousands of daily reviews
2. **Not domain-optimized** β€” miss Brazilian Portuguese idioms, e-commerce-specific patterns
3. **Not self-hosted** β€” customer data leaves the organization for every API call

**Goal:** Build a compact (3.7B parameter) model that matches or exceeds large general models on e-commerce-specific tasks, runs on a single GPU, and keeps data on-premise.

---

## 2. Context & Approach

### Architecture Decision: SFT + GRPO

The training pipeline follows the DeepSeek-R1 paradigm (arxiv: 2501.12948):

```
Qwen3-3.7B (base) β†’ SFT (domain adaptation) β†’ GRPO (alignment via reward signals)
```

**Why Qwen3-3.7B:**
- Strong multilingual base with Portuguese capability
- 3.7B parameters fits in 24GB VRAM with 4-bit quantization (Unsloth NF4)
- Qwen3 architecture includes native `<think>` reasoning mode

**Why GRPO over DPO/PPO:**
- No need for a separate reward model (rule-based rewards suffice for structured tasks)
- Group-relative optimization naturally handles multi-task reward distributions
- Published results show GRPO working well at this model scale (Dr. GRPO, Skywork-OR1)

**Why rule-based rewards:**
- E-commerce tasks have verifiable outputs (JSON schema adherence, SQL execution, sentiment polarity)
- Neural reward models introduce reward hacking at small scale
- DeepSeek-R1 demonstrated rule-based rewards outperform neural reward models for structured tasks

### Task Portfolio

| Task | Input | Output | Evaluation |
|------|-------|--------|------------|
| Structured Extraction | Customer review + metadata | JSON with 10 fields | Field-level match |
| Sentiment Analysis | Review text | Polarity + score | Accuracy + F1 |
| SQL Generation | Business question | Executable SQL query | Execution accuracy |
| Churn Prediction | Customer profile | Risk score + reasoning | Binary accuracy |
| Business Insights | Open-ended question | Analytical report in PT-BR | LLM-as-judge |

### Infrastructure

- **Training:** Vertex AI Workbench, single NVIDIA L4 (24GB VRAM)
- **Quantization:** Unsloth NF4 (4-bit) for training β€” enables 3.7B model to fit in 24GB
- **Framework:** TRL 0.24.0 (pinned for Unsloth compatibility), `UnslothGRPOTrainer`
- **Monitoring:** Weights & Biases

---

## 3. Decision Log

### Decision 1: Continuous vs. Binary Reward Functions
- **Context:** Initial reward functions used binary (0/1) scoring
- **Problem:** 50% of training steps showed `reward_std=0` and `loss=0` β€” no learning signal
- **Decision:** Rewrote all 4 reward functions with continuous scoring (0.0–1.0), partial credit for partially correct outputs
- **Consequence:** Zero-std steps dropped from 50% to ~10%; loss became consistently non-zero
- **Reference:** Dr. GRPO paper (2503.20783) proves std-based normalization amplifies this issue

### Decision 2: Temperature 0.8 β†’ 1.0
- **Context:** Model's `generation_config.json` had `temperature=0.1` (default from Qwen3)
- **Problem:** All 8 GRPO completions were near-identical β†’ zero reward variance β†’ zero advantage β†’ zero gradient. `frac_reward_zero_std=1.0` on every step. First full run was killed.
- **Decision v2:** Set `temperature=0.8` in GRPOConfig
- **Outcome v2:** Fixed the zero-std catastrophe. Training ran 210 steps, eval improved 50% (0.083β†’0.125)
- **Decision v3 (planned):** Increase to `temperature=1.0` β€” all published GRPO papers (DeepSeek-R1, Dr. GRPO, Skywork-OR1) use 1.0. Higher temperature further delays entropy collapse.
- **Reference:** Skywork-OR1 (2505.22312) ablation: Ο„=1.0 gives 5-8% better test performance than Ο„=0.6

### Decision 3: `scale_rewards=False` (Dr. GRPO)
- **Context:** Standard GRPO normalizes advantages by `std(rewards)` per group
- **Problem:** When one group has low variance, dividing by a small std inflates its gradient contribution β†’ training instability and bias toward "easy" prompts
- **Decision:** Disabled std normalization following Dr. GRPO paper
- **Consequence:** More stable training; combined with continuous rewards, eliminated most zero-gradient steps
- **Reference:** Dr. GRPO (2503.20783) achieved SOTA 43.3% on AIME 2024 with a 7B model using this fix

### Decision 4: Early Stopping Configuration
- **Context (v2 run 1):** `EARLY_STOPPING_PATIENCE=3`, `EVAL_STEPS=10`, `EVAL_MAX_TOKENS=256`
- **Problem:** Model killed at step 40. Eval tokens too short β€” model needs 500-700 tokens just for `</think>`. Eval was scoring incomplete generations. Only 30 steps of runway before early stopping fired.
- **Decision (v2 run 2):** `PATIENCE=10`, `EVAL_MAX_TOKENS=2048`, `EVAL_MAX_SAMPLES=5`
- **Consequence:** Training ran to step 210, early stopping fired correctly when eval plateaued
- **Lesson:** Early stopping parameters must account for the model's generation length requirements

### Decision 5: `MAX_STEPS` vs. `NUM_EPOCHS`
- **Context:** User set `NUM_EPOCHS=2`, `MAX_STEPS=300`
- **Clarification:** In TRL, `MAX_STEPS` takes absolute priority. With 300 prompts Γ— 8 generations / (batch_size=4 Γ— grad_accum=2) = 300 steps per epoch. `MAX_STEPS=300` = exactly one epoch regardless of `NUM_EPOCHS`.
- **Decision:** Keep `MAX_STEPS=300` for one clean epoch; decide on epoch 2 based on eval trajectory
- **Consequence:** Early stopping at 210 means the model trained through 70% of the data before plateauing

### Decision 6: TRL 0.24.0 Pinning
- **Context:** Unsloth requires specific TRL versions. Upgrading TRL broke vllm/torch dependencies.
- **Decision:** Pin `trl==0.24.0 --no-deps` after Unsloth installation
- **Consequence:** Stable environment, but locks out newer TRL features (e.g., entropy bonus in GRPOConfig)
- **Workaround for v3:** Implement entropy control via custom callback or trainer subclass

---

## 4. Training Results

### v2 Final Run (Step 210/300, Early Stopped)

| Metric | Value | Assessment |
|--------|-------|------------|
| `eval/best_reward_final` | 0.125 | +50% from starting 0.083 |
| `train/reward` | 0.285 | Below SFT calibration (0.38) |
| `train/frac_reward_zero_std` | 0.0 | βœ… Fixed (was 1.0 in v1) |
| `train/kl` | 0.004 | Very conservative policy shift |
| `train/clip_ratio` | 0.0 (all) | ⚠️ Entropy collapse β€” policy never hit clip bounds |
| `train/completion_length` | 2048 (= max) | ⚠️ Every completion truncated |
| `train/grad_norm` | 0.030 | Stable |
| Duration | 14.9 hours | 210 steps Γ— ~4.3 min/step |

### Validation Results (5 held-out prompts)

| Sample | Task | Reward | Notes |
|--------|------|--------|-------|
| 1 | Extraction (JSON) | 0.12 | Fields incorrect, output truncated |
| 2 | Insights (categories) | 0.70 | Coherent PT-BR, structured headers |
| 3 | Retention analysis | 0.70 | Step-by-step methodology |
| 4 | Reengagement decision | 0.50 | `<think>` reasoning visible, contextual |
| 5 | Regional comparison | 0.70 | Comparative framework |

**Mean validation reward: 0.54** vs SFT calibration baseline of **0.38** β†’ **+42% improvement**

### Bimodal Performance Pattern

- **Strong (0.50–0.70):** Open-ended analysis, insights, comparison tasks
- **Weak (0.12):** Structured JSON extraction β€” the completion ceiling blocks the model from outputting complete JSON

---

## 5. Diagnosed Issues & Root Causes

### Issue 1: Entropy Collapse (Critical)
- **Symptom:** `clip_ratio=0` on all steps, KL=0.004
- **Root cause:** Policy entropy drops to near-zero β†’ all 8 rollouts produce identical output β†’ zero advantage β†’ zero gradient (Skywork-OR1, 2505.22312)
- **Fix (v3):** Temperature=1.0, add entropy loss coefficient (Ξ±=5e-3), filter zero-advantage groups

### Issue 2: Completion Length Ceiling (Critical)
- **Symptom:** `completion_length=2048` (= `max_completion_length`) on every step
- **Root cause:** GRPO length bias inflates incorrect response length (Dr. GRPO, 2503.20783 Β§3.1). Model can't finish reasoning β†’ gets low rewards β†’ weak signal
- **Fix (v3):** Increase `max_completion_length` to 4096, reduce `num_generations` 8β†’4 to fit VRAM

### Issue 3: Data Scale (Moderate)
- **Symptom:** Early stopping at step 210 (70% of epoch), eval plateaued at 0.125
- **Root cause:** 300 prompts is below published minimums (1K–600K in literature)
- **Fix (v3):** Expand to 1000+ prompts via synthetic generation and data augmentation

---

## 6. Lessons Learned

### Technical Lessons

1. **Default model configs kill RL training.** Qwen3's `generation_config.json` sets `temperature=0.1`. This single default destroyed the first full training run. Always override generation parameters explicitly.

2. **Reward function design is the core ML engineering task.** Binary rewards β†’ zero signal. Continuous rewards β†’ training works. Multi-component rewards (format + content + quality) β†’ staged learning where format converges first. The reward function IS the product specification.

3. **GRPO needs diversity to learn.** The algorithm is fundamentally about comparing different completions. Anything that reduces diversity (low temperature, small group size, few prompts, completion ceiling) directly reduces learning signal.

4. **TRL step calculation is non-obvious.** `steps = num_prompts Γ— num_generations / (batch_size Γ— grad_accum)`. Missing the `num_generations` multiplier gives wrong epoch estimates. `MAX_STEPS` always overrides `NUM_EPOCHS`.

5. **Early stopping needs tuning for generative models.** Patience must account for eval generation length. Short eval tokens β†’ incomplete outputs β†’ flat eval scores β†’ premature stopping.

6. **Entropy collapse is the GRPO failure mode.** Not divergence, not reward hacking β€” the model collapses to deterministic output. Monitoring `clip_ratio` and generation entropy is essential.

### Business Lessons

1. **Domain data is the moat, not model size.** ThinkJSON (1.5B) beats DeepSeek-R1 (671B) on JSON extraction. A 7B model beats o3-mini on SQL. 300 Portuguese e-commerce examples already produced a model that outperforms SFT baseline by 42%.

2. **Cost arbitrage is immediate.** A self-hosted 3.7B model on a $0.50/hr GPU costs ~$0.001/analysis vs $0.01+ for API calls. Breakeven at ~100 analyses/day.

3. **Privacy is a feature.** Self-hosted means customer data never leaves the organization. This matters for LGPD (Brazilian data protection law) compliance.

4. **Portuguese-first is defensible.** Most LLM development is English-first. A model that deeply understands Brazilian e-commerce Portuguese ("veio com defeito", "nota 1 estrela") has a real competitive advantage.

5. **Budget 3-5 iterations, not 1.** The first run is diagnostic. v1 found the zero-signal bug. v2 found the temperature bug and completion ceiling. v3 will address entropy collapse. Each iteration is cheaper than the last because you know what to measure.

---

## 7. Next Steps

See `docs/ADR-001-next-steps.md` for detailed execution plans.

### Priority 1: Build Domain Benchmark (1-2 days)
50-100 held-out prompts, automated scoring, establish baselines

### Priority 2: Run Comparison vs. Qwen3-35B-A3B (1 day)
Prove small tuned model matches/beats large general model on domain tasks

### Priority 3: GRPO v3 Training Run (2-3 days)
Fix entropy collapse, increase completion length, expand training data

---

## References

| Paper | Key Finding | Relevance |
|-------|------------|-----------|
| DeepSeek-R1 (2501.12948) | SFT→GRPO pipeline, rule-based rewards | Architecture template |
| Dr. GRPO (2503.20783) | Remove std normalization, remove length bias | Fixes reward scaling |
| Skywork-OR1 MAGIC (2505.22312) | Entropy collapse diagnosis and fix | Explains clip_ratio=0 |
| MC-GRPO (2601.22582) | Median baseline for small rollout budgets | Fixes G=8 noise |
| ThinkJSON (2502.14905) | 1.5B beats 671B on JSON extraction | Proves domain specialization thesis |
| Reasoning-SQL (2503.23157) | 7B beats o3-mini on SQL with GRPO | Proves GRPO works for SQL |
| Cocktail Effect (2410.01109) | Multi-task SFT + general data boosts domain performance | SFT improvement recipe |