rtferraz commited on
Commit
aa71b0c
Β·
verified Β·
1 Parent(s): 901bdc7

docs: add project documentation

Browse files
Files changed (1) hide show
  1. docs/PROJECT.md +221 -0
docs/PROJECT.md ADDED
@@ -0,0 +1,221 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Tucano2-Commerce: Domain-Specialized LLM for Brazilian E-Commerce Analysis
2
+
3
+ ## Project Status: v2 Complete β€” v3 Planned
4
+
5
+ **Model:** Qwen3-3.7B β†’ SFT β†’ GRPO alignment
6
+ **Domain:** Brazilian e-commerce (sentiment analysis, churn prediction, SQL generation, structured extraction)
7
+ **Infrastructure:** Vertex AI Workbench, NVIDIA L4 (24GB), Unsloth + TRL 0.24.0
8
+ **Tracking:** W&B project `tferrazrafael-self/tucano2-commerce`
9
+
10
+ ---
11
+
12
+ ## 1. Problem Statement
13
+
14
+ Brazilian e-commerce companies need automated analysis of customer reviews, churn prediction, and business intelligence generation β€” all in Portuguese. General-purpose LLMs (GPT-4o, Claude) are:
15
+
16
+ 1. **Expensive at scale** β€” API costs of ~$0.01/analysis Γ— thousands of daily reviews
17
+ 2. **Not domain-optimized** β€” miss Brazilian Portuguese idioms, e-commerce-specific patterns
18
+ 3. **Not self-hosted** β€” customer data leaves the organization for every API call
19
+
20
+ **Goal:** Build a compact (3.7B parameter) model that matches or exceeds large general models on e-commerce-specific tasks, runs on a single GPU, and keeps data on-premise.
21
+
22
+ ---
23
+
24
+ ## 2. Context & Approach
25
+
26
+ ### Architecture Decision: SFT + GRPO
27
+
28
+ The training pipeline follows the DeepSeek-R1 paradigm (arxiv: 2501.12948):
29
+
30
+ ```
31
+ Qwen3-3.7B (base) β†’ SFT (domain adaptation) β†’ GRPO (alignment via reward signals)
32
+ ```
33
+
34
+ **Why Qwen3-3.7B:**
35
+ - Strong multilingual base with Portuguese capability
36
+ - 3.7B parameters fits in 24GB VRAM with 4-bit quantization (Unsloth NF4)
37
+ - Qwen3 architecture includes native `<think>` reasoning mode
38
+
39
+ **Why GRPO over DPO/PPO:**
40
+ - No need for a separate reward model (rule-based rewards suffice for structured tasks)
41
+ - Group-relative optimization naturally handles multi-task reward distributions
42
+ - Published results show GRPO working well at this model scale (Dr. GRPO, Skywork-OR1)
43
+
44
+ **Why rule-based rewards:**
45
+ - E-commerce tasks have verifiable outputs (JSON schema adherence, SQL execution, sentiment polarity)
46
+ - Neural reward models introduce reward hacking at small scale
47
+ - DeepSeek-R1 demonstrated rule-based rewards outperform neural reward models for structured tasks
48
+
49
+ ### Task Portfolio
50
+
51
+ | Task | Input | Output | Evaluation |
52
+ |------|-------|--------|------------|
53
+ | Structured Extraction | Customer review + metadata | JSON with 10 fields | Field-level match |
54
+ | Sentiment Analysis | Review text | Polarity + score | Accuracy + F1 |
55
+ | SQL Generation | Business question | Executable SQL query | Execution accuracy |
56
+ | Churn Prediction | Customer profile | Risk score + reasoning | Binary accuracy |
57
+ | Business Insights | Open-ended question | Analytical report in PT-BR | LLM-as-judge |
58
+
59
+ ### Infrastructure
60
+
61
+ - **Training:** Vertex AI Workbench, single NVIDIA L4 (24GB VRAM)
62
+ - **Quantization:** Unsloth NF4 (4-bit) for training β€” enables 3.7B model to fit in 24GB
63
+ - **Framework:** TRL 0.24.0 (pinned for Unsloth compatibility), `UnslothGRPOTrainer`
64
+ - **Monitoring:** Weights & Biases
65
+
66
+ ---
67
+
68
+ ## 3. Decision Log
69
+
70
+ ### Decision 1: Continuous vs. Binary Reward Functions
71
+ - **Context:** Initial reward functions used binary (0/1) scoring
72
+ - **Problem:** 50% of training steps showed `reward_std=0` and `loss=0` β€” no learning signal
73
+ - **Decision:** Rewrote all 4 reward functions with continuous scoring (0.0–1.0), partial credit for partially correct outputs
74
+ - **Consequence:** Zero-std steps dropped from 50% to ~10%; loss became consistently non-zero
75
+ - **Reference:** Dr. GRPO paper (2503.20783) proves std-based normalization amplifies this issue
76
+
77
+ ### Decision 2: Temperature 0.8 β†’ 1.0
78
+ - **Context:** Model's `generation_config.json` had `temperature=0.1` (default from Qwen3)
79
+ - **Problem:** All 8 GRPO completions were near-identical β†’ zero reward variance β†’ zero advantage β†’ zero gradient. `frac_reward_zero_std=1.0` on every step. First full run was killed.
80
+ - **Decision v2:** Set `temperature=0.8` in GRPOConfig
81
+ - **Outcome v2:** Fixed the zero-std catastrophe. Training ran 210 steps, eval improved 50% (0.083β†’0.125)
82
+ - **Decision v3 (planned):** Increase to `temperature=1.0` β€” all published GRPO papers (DeepSeek-R1, Dr. GRPO, Skywork-OR1) use 1.0. Higher temperature further delays entropy collapse.
83
+ - **Reference:** Skywork-OR1 (2505.22312) ablation: Ο„=1.0 gives 5-8% better test performance than Ο„=0.6
84
+
85
+ ### Decision 3: `scale_rewards=False` (Dr. GRPO)
86
+ - **Context:** Standard GRPO normalizes advantages by `std(rewards)` per group
87
+ - **Problem:** When one group has low variance, dividing by a small std inflates its gradient contribution β†’ training instability and bias toward "easy" prompts
88
+ - **Decision:** Disabled std normalization following Dr. GRPO paper
89
+ - **Consequence:** More stable training; combined with continuous rewards, eliminated most zero-gradient steps
90
+ - **Reference:** Dr. GRPO (2503.20783) achieved SOTA 43.3% on AIME 2024 with a 7B model using this fix
91
+
92
+ ### Decision 4: Early Stopping Configuration
93
+ - **Context (v2 run 1):** `EARLY_STOPPING_PATIENCE=3`, `EVAL_STEPS=10`, `EVAL_MAX_TOKENS=256`
94
+ - **Problem:** Model killed at step 40. Eval tokens too short β€” model needs 500-700 tokens just for `</think>`. Eval was scoring incomplete generations. Only 30 steps of runway before early stopping fired.
95
+ - **Decision (v2 run 2):** `PATIENCE=10`, `EVAL_MAX_TOKENS=2048`, `EVAL_MAX_SAMPLES=5`
96
+ - **Consequence:** Training ran to step 210, early stopping fired correctly when eval plateaued
97
+ - **Lesson:** Early stopping parameters must account for the model's generation length requirements
98
+
99
+ ### Decision 5: `MAX_STEPS` vs. `NUM_EPOCHS`
100
+ - **Context:** User set `NUM_EPOCHS=2`, `MAX_STEPS=300`
101
+ - **Clarification:** In TRL, `MAX_STEPS` takes absolute priority. With 300 prompts Γ— 8 generations / (batch_size=4 Γ— grad_accum=2) = 300 steps per epoch. `MAX_STEPS=300` = exactly one epoch regardless of `NUM_EPOCHS`.
102
+ - **Decision:** Keep `MAX_STEPS=300` for one clean epoch; decide on epoch 2 based on eval trajectory
103
+ - **Consequence:** Early stopping at 210 means the model trained through 70% of the data before plateauing
104
+
105
+ ### Decision 6: TRL 0.24.0 Pinning
106
+ - **Context:** Unsloth requires specific TRL versions. Upgrading TRL broke vllm/torch dependencies.
107
+ - **Decision:** Pin `trl==0.24.0 --no-deps` after Unsloth installation
108
+ - **Consequence:** Stable environment, but locks out newer TRL features (e.g., entropy bonus in GRPOConfig)
109
+ - **Workaround for v3:** Implement entropy control via custom callback or trainer subclass
110
+
111
+ ---
112
+
113
+ ## 4. Training Results
114
+
115
+ ### v2 Final Run (Step 210/300, Early Stopped)
116
+
117
+ | Metric | Value | Assessment |
118
+ |--------|-------|------------|
119
+ | `eval/best_reward_final` | 0.125 | +50% from starting 0.083 |
120
+ | `train/reward` | 0.285 | Below SFT calibration (0.38) |
121
+ | `train/frac_reward_zero_std` | 0.0 | βœ… Fixed (was 1.0 in v1) |
122
+ | `train/kl` | 0.004 | Very conservative policy shift |
123
+ | `train/clip_ratio` | 0.0 (all) | ⚠️ Entropy collapse β€” policy never hit clip bounds |
124
+ | `train/completion_length` | 2048 (= max) | ⚠️ Every completion truncated |
125
+ | `train/grad_norm` | 0.030 | Stable |
126
+ | Duration | 14.9 hours | 210 steps Γ— ~4.3 min/step |
127
+
128
+ ### Validation Results (5 held-out prompts)
129
+
130
+ | Sample | Task | Reward | Notes |
131
+ |--------|------|--------|-------|
132
+ | 1 | Extraction (JSON) | 0.12 | Fields incorrect, output truncated |
133
+ | 2 | Insights (categories) | 0.70 | Coherent PT-BR, structured headers |
134
+ | 3 | Retention analysis | 0.70 | Step-by-step methodology |
135
+ | 4 | Reengagement decision | 0.50 | `<think>` reasoning visible, contextual |
136
+ | 5 | Regional comparison | 0.70 | Comparative framework |
137
+
138
+ **Mean validation reward: 0.54** vs SFT calibration baseline of **0.38** β†’ **+42% improvement**
139
+
140
+ ### Bimodal Performance Pattern
141
+
142
+ - **Strong (0.50–0.70):** Open-ended analysis, insights, comparison tasks
143
+ - **Weak (0.12):** Structured JSON extraction β€” the completion ceiling blocks the model from outputting complete JSON
144
+
145
+ ---
146
+
147
+ ## 5. Diagnosed Issues & Root Causes
148
+
149
+ ### Issue 1: Entropy Collapse (Critical)
150
+ - **Symptom:** `clip_ratio=0` on all steps, KL=0.004
151
+ - **Root cause:** Policy entropy drops to near-zero β†’ all 8 rollouts produce identical output β†’ zero advantage β†’ zero gradient (Skywork-OR1, 2505.22312)
152
+ - **Fix (v3):** Temperature=1.0, add entropy loss coefficient (Ξ±=5e-3), filter zero-advantage groups
153
+
154
+ ### Issue 2: Completion Length Ceiling (Critical)
155
+ - **Symptom:** `completion_length=2048` (= `max_completion_length`) on every step
156
+ - **Root cause:** GRPO length bias inflates incorrect response length (Dr. GRPO, 2503.20783 Β§3.1). Model can't finish reasoning β†’ gets low rewards β†’ weak signal
157
+ - **Fix (v3):** Increase `max_completion_length` to 4096, reduce `num_generations` 8β†’4 to fit VRAM
158
+
159
+ ### Issue 3: Data Scale (Moderate)
160
+ - **Symptom:** Early stopping at step 210 (70% of epoch), eval plateaued at 0.125
161
+ - **Root cause:** 300 prompts is below published minimums (1K–600K in literature)
162
+ - **Fix (v3):** Expand to 1000+ prompts via synthetic generation and data augmentation
163
+
164
+ ---
165
+
166
+ ## 6. Lessons Learned
167
+
168
+ ### Technical Lessons
169
+
170
+ 1. **Default model configs kill RL training.** Qwen3's `generation_config.json` sets `temperature=0.1`. This single default destroyed the first full training run. Always override generation parameters explicitly.
171
+
172
+ 2. **Reward function design is the core ML engineering task.** Binary rewards β†’ zero signal. Continuous rewards β†’ training works. Multi-component rewards (format + content + quality) β†’ staged learning where format converges first. The reward function IS the product specification.
173
+
174
+ 3. **GRPO needs diversity to learn.** The algorithm is fundamentally about comparing different completions. Anything that reduces diversity (low temperature, small group size, few prompts, completion ceiling) directly reduces learning signal.
175
+
176
+ 4. **TRL step calculation is non-obvious.** `steps = num_prompts Γ— num_generations / (batch_size Γ— grad_accum)`. Missing the `num_generations` multiplier gives wrong epoch estimates. `MAX_STEPS` always overrides `NUM_EPOCHS`.
177
+
178
+ 5. **Early stopping needs tuning for generative models.** Patience must account for eval generation length. Short eval tokens β†’ incomplete outputs β†’ flat eval scores β†’ premature stopping.
179
+
180
+ 6. **Entropy collapse is the GRPO failure mode.** Not divergence, not reward hacking β€” the model collapses to deterministic output. Monitoring `clip_ratio` and generation entropy is essential.
181
+
182
+ ### Business Lessons
183
+
184
+ 1. **Domain data is the moat, not model size.** ThinkJSON (1.5B) beats DeepSeek-R1 (671B) on JSON extraction. A 7B model beats o3-mini on SQL. 300 Portuguese e-commerce examples already produced a model that outperforms SFT baseline by 42%.
185
+
186
+ 2. **Cost arbitrage is immediate.** A self-hosted 3.7B model on a $0.50/hr GPU costs ~$0.001/analysis vs $0.01+ for API calls. Breakeven at ~100 analyses/day.
187
+
188
+ 3. **Privacy is a feature.** Self-hosted means customer data never leaves the organization. This matters for LGPD (Brazilian data protection law) compliance.
189
+
190
+ 4. **Portuguese-first is defensible.** Most LLM development is English-first. A model that deeply understands Brazilian e-commerce Portuguese ("veio com defeito", "nota 1 estrela") has a real competitive advantage.
191
+
192
+ 5. **Budget 3-5 iterations, not 1.** The first run is diagnostic. v1 found the zero-signal bug. v2 found the temperature bug and completion ceiling. v3 will address entropy collapse. Each iteration is cheaper than the last because you know what to measure.
193
+
194
+ ---
195
+
196
+ ## 7. Next Steps
197
+
198
+ See `docs/ADR-001-next-steps.md` for detailed execution plans.
199
+
200
+ ### Priority 1: Build Domain Benchmark (1-2 days)
201
+ 50-100 held-out prompts, automated scoring, establish baselines
202
+
203
+ ### Priority 2: Run Comparison vs. Qwen3-35B-A3B (1 day)
204
+ Prove small tuned model matches/beats large general model on domain tasks
205
+
206
+ ### Priority 3: GRPO v3 Training Run (2-3 days)
207
+ Fix entropy collapse, increase completion length, expand training data
208
+
209
+ ---
210
+
211
+ ## References
212
+
213
+ | Paper | Key Finding | Relevance |
214
+ |-------|------------|-----------|
215
+ | DeepSeek-R1 (2501.12948) | SFT→GRPO pipeline, rule-based rewards | Architecture template |
216
+ | Dr. GRPO (2503.20783) | Remove std normalization, remove length bias | Fixes reward scaling |
217
+ | Skywork-OR1 MAGIC (2505.22312) | Entropy collapse diagnosis and fix | Explains clip_ratio=0 |
218
+ | MC-GRPO (2601.22582) | Median baseline for small rollout budgets | Fixes G=8 noise |
219
+ | ThinkJSON (2502.14905) | 1.5B beats 671B on JSON extraction | Proves domain specialization thesis |
220
+ | Reasoning-SQL (2503.23157) | 7B beats o3-mini on SQL with GRPO | Proves GRPO works for SQL |
221
+ | Cocktail Effect (2410.01109) | Multi-task SFT + general data boosts domain performance | SFT improvement recipe |