docs: add project documentation
Browse files- docs/PROJECT.md +221 -0
docs/PROJECT.md
ADDED
|
@@ -0,0 +1,221 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Tucano2-Commerce: Domain-Specialized LLM for Brazilian E-Commerce Analysis
|
| 2 |
+
|
| 3 |
+
## Project Status: v2 Complete β v3 Planned
|
| 4 |
+
|
| 5 |
+
**Model:** Qwen3-3.7B β SFT β GRPO alignment
|
| 6 |
+
**Domain:** Brazilian e-commerce (sentiment analysis, churn prediction, SQL generation, structured extraction)
|
| 7 |
+
**Infrastructure:** Vertex AI Workbench, NVIDIA L4 (24GB), Unsloth + TRL 0.24.0
|
| 8 |
+
**Tracking:** W&B project `tferrazrafael-self/tucano2-commerce`
|
| 9 |
+
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
## 1. Problem Statement
|
| 13 |
+
|
| 14 |
+
Brazilian e-commerce companies need automated analysis of customer reviews, churn prediction, and business intelligence generation β all in Portuguese. General-purpose LLMs (GPT-4o, Claude) are:
|
| 15 |
+
|
| 16 |
+
1. **Expensive at scale** β API costs of ~$0.01/analysis Γ thousands of daily reviews
|
| 17 |
+
2. **Not domain-optimized** β miss Brazilian Portuguese idioms, e-commerce-specific patterns
|
| 18 |
+
3. **Not self-hosted** β customer data leaves the organization for every API call
|
| 19 |
+
|
| 20 |
+
**Goal:** Build a compact (3.7B parameter) model that matches or exceeds large general models on e-commerce-specific tasks, runs on a single GPU, and keeps data on-premise.
|
| 21 |
+
|
| 22 |
+
---
|
| 23 |
+
|
| 24 |
+
## 2. Context & Approach
|
| 25 |
+
|
| 26 |
+
### Architecture Decision: SFT + GRPO
|
| 27 |
+
|
| 28 |
+
The training pipeline follows the DeepSeek-R1 paradigm (arxiv: 2501.12948):
|
| 29 |
+
|
| 30 |
+
```
|
| 31 |
+
Qwen3-3.7B (base) β SFT (domain adaptation) β GRPO (alignment via reward signals)
|
| 32 |
+
```
|
| 33 |
+
|
| 34 |
+
**Why Qwen3-3.7B:**
|
| 35 |
+
- Strong multilingual base with Portuguese capability
|
| 36 |
+
- 3.7B parameters fits in 24GB VRAM with 4-bit quantization (Unsloth NF4)
|
| 37 |
+
- Qwen3 architecture includes native `<think>` reasoning mode
|
| 38 |
+
|
| 39 |
+
**Why GRPO over DPO/PPO:**
|
| 40 |
+
- No need for a separate reward model (rule-based rewards suffice for structured tasks)
|
| 41 |
+
- Group-relative optimization naturally handles multi-task reward distributions
|
| 42 |
+
- Published results show GRPO working well at this model scale (Dr. GRPO, Skywork-OR1)
|
| 43 |
+
|
| 44 |
+
**Why rule-based rewards:**
|
| 45 |
+
- E-commerce tasks have verifiable outputs (JSON schema adherence, SQL execution, sentiment polarity)
|
| 46 |
+
- Neural reward models introduce reward hacking at small scale
|
| 47 |
+
- DeepSeek-R1 demonstrated rule-based rewards outperform neural reward models for structured tasks
|
| 48 |
+
|
| 49 |
+
### Task Portfolio
|
| 50 |
+
|
| 51 |
+
| Task | Input | Output | Evaluation |
|
| 52 |
+
|------|-------|--------|------------|
|
| 53 |
+
| Structured Extraction | Customer review + metadata | JSON with 10 fields | Field-level match |
|
| 54 |
+
| Sentiment Analysis | Review text | Polarity + score | Accuracy + F1 |
|
| 55 |
+
| SQL Generation | Business question | Executable SQL query | Execution accuracy |
|
| 56 |
+
| Churn Prediction | Customer profile | Risk score + reasoning | Binary accuracy |
|
| 57 |
+
| Business Insights | Open-ended question | Analytical report in PT-BR | LLM-as-judge |
|
| 58 |
+
|
| 59 |
+
### Infrastructure
|
| 60 |
+
|
| 61 |
+
- **Training:** Vertex AI Workbench, single NVIDIA L4 (24GB VRAM)
|
| 62 |
+
- **Quantization:** Unsloth NF4 (4-bit) for training β enables 3.7B model to fit in 24GB
|
| 63 |
+
- **Framework:** TRL 0.24.0 (pinned for Unsloth compatibility), `UnslothGRPOTrainer`
|
| 64 |
+
- **Monitoring:** Weights & Biases
|
| 65 |
+
|
| 66 |
+
---
|
| 67 |
+
|
| 68 |
+
## 3. Decision Log
|
| 69 |
+
|
| 70 |
+
### Decision 1: Continuous vs. Binary Reward Functions
|
| 71 |
+
- **Context:** Initial reward functions used binary (0/1) scoring
|
| 72 |
+
- **Problem:** 50% of training steps showed `reward_std=0` and `loss=0` β no learning signal
|
| 73 |
+
- **Decision:** Rewrote all 4 reward functions with continuous scoring (0.0β1.0), partial credit for partially correct outputs
|
| 74 |
+
- **Consequence:** Zero-std steps dropped from 50% to ~10%; loss became consistently non-zero
|
| 75 |
+
- **Reference:** Dr. GRPO paper (2503.20783) proves std-based normalization amplifies this issue
|
| 76 |
+
|
| 77 |
+
### Decision 2: Temperature 0.8 β 1.0
|
| 78 |
+
- **Context:** Model's `generation_config.json` had `temperature=0.1` (default from Qwen3)
|
| 79 |
+
- **Problem:** All 8 GRPO completions were near-identical β zero reward variance β zero advantage β zero gradient. `frac_reward_zero_std=1.0` on every step. First full run was killed.
|
| 80 |
+
- **Decision v2:** Set `temperature=0.8` in GRPOConfig
|
| 81 |
+
- **Outcome v2:** Fixed the zero-std catastrophe. Training ran 210 steps, eval improved 50% (0.083β0.125)
|
| 82 |
+
- **Decision v3 (planned):** Increase to `temperature=1.0` β all published GRPO papers (DeepSeek-R1, Dr. GRPO, Skywork-OR1) use 1.0. Higher temperature further delays entropy collapse.
|
| 83 |
+
- **Reference:** Skywork-OR1 (2505.22312) ablation: Ο=1.0 gives 5-8% better test performance than Ο=0.6
|
| 84 |
+
|
| 85 |
+
### Decision 3: `scale_rewards=False` (Dr. GRPO)
|
| 86 |
+
- **Context:** Standard GRPO normalizes advantages by `std(rewards)` per group
|
| 87 |
+
- **Problem:** When one group has low variance, dividing by a small std inflates its gradient contribution β training instability and bias toward "easy" prompts
|
| 88 |
+
- **Decision:** Disabled std normalization following Dr. GRPO paper
|
| 89 |
+
- **Consequence:** More stable training; combined with continuous rewards, eliminated most zero-gradient steps
|
| 90 |
+
- **Reference:** Dr. GRPO (2503.20783) achieved SOTA 43.3% on AIME 2024 with a 7B model using this fix
|
| 91 |
+
|
| 92 |
+
### Decision 4: Early Stopping Configuration
|
| 93 |
+
- **Context (v2 run 1):** `EARLY_STOPPING_PATIENCE=3`, `EVAL_STEPS=10`, `EVAL_MAX_TOKENS=256`
|
| 94 |
+
- **Problem:** Model killed at step 40. Eval tokens too short β model needs 500-700 tokens just for `</think>`. Eval was scoring incomplete generations. Only 30 steps of runway before early stopping fired.
|
| 95 |
+
- **Decision (v2 run 2):** `PATIENCE=10`, `EVAL_MAX_TOKENS=2048`, `EVAL_MAX_SAMPLES=5`
|
| 96 |
+
- **Consequence:** Training ran to step 210, early stopping fired correctly when eval plateaued
|
| 97 |
+
- **Lesson:** Early stopping parameters must account for the model's generation length requirements
|
| 98 |
+
|
| 99 |
+
### Decision 5: `MAX_STEPS` vs. `NUM_EPOCHS`
|
| 100 |
+
- **Context:** User set `NUM_EPOCHS=2`, `MAX_STEPS=300`
|
| 101 |
+
- **Clarification:** In TRL, `MAX_STEPS` takes absolute priority. With 300 prompts Γ 8 generations / (batch_size=4 Γ grad_accum=2) = 300 steps per epoch. `MAX_STEPS=300` = exactly one epoch regardless of `NUM_EPOCHS`.
|
| 102 |
+
- **Decision:** Keep `MAX_STEPS=300` for one clean epoch; decide on epoch 2 based on eval trajectory
|
| 103 |
+
- **Consequence:** Early stopping at 210 means the model trained through 70% of the data before plateauing
|
| 104 |
+
|
| 105 |
+
### Decision 6: TRL 0.24.0 Pinning
|
| 106 |
+
- **Context:** Unsloth requires specific TRL versions. Upgrading TRL broke vllm/torch dependencies.
|
| 107 |
+
- **Decision:** Pin `trl==0.24.0 --no-deps` after Unsloth installation
|
| 108 |
+
- **Consequence:** Stable environment, but locks out newer TRL features (e.g., entropy bonus in GRPOConfig)
|
| 109 |
+
- **Workaround for v3:** Implement entropy control via custom callback or trainer subclass
|
| 110 |
+
|
| 111 |
+
---
|
| 112 |
+
|
| 113 |
+
## 4. Training Results
|
| 114 |
+
|
| 115 |
+
### v2 Final Run (Step 210/300, Early Stopped)
|
| 116 |
+
|
| 117 |
+
| Metric | Value | Assessment |
|
| 118 |
+
|--------|-------|------------|
|
| 119 |
+
| `eval/best_reward_final` | 0.125 | +50% from starting 0.083 |
|
| 120 |
+
| `train/reward` | 0.285 | Below SFT calibration (0.38) |
|
| 121 |
+
| `train/frac_reward_zero_std` | 0.0 | β
Fixed (was 1.0 in v1) |
|
| 122 |
+
| `train/kl` | 0.004 | Very conservative policy shift |
|
| 123 |
+
| `train/clip_ratio` | 0.0 (all) | β οΈ Entropy collapse β policy never hit clip bounds |
|
| 124 |
+
| `train/completion_length` | 2048 (= max) | β οΈ Every completion truncated |
|
| 125 |
+
| `train/grad_norm` | 0.030 | Stable |
|
| 126 |
+
| Duration | 14.9 hours | 210 steps Γ ~4.3 min/step |
|
| 127 |
+
|
| 128 |
+
### Validation Results (5 held-out prompts)
|
| 129 |
+
|
| 130 |
+
| Sample | Task | Reward | Notes |
|
| 131 |
+
|--------|------|--------|-------|
|
| 132 |
+
| 1 | Extraction (JSON) | 0.12 | Fields incorrect, output truncated |
|
| 133 |
+
| 2 | Insights (categories) | 0.70 | Coherent PT-BR, structured headers |
|
| 134 |
+
| 3 | Retention analysis | 0.70 | Step-by-step methodology |
|
| 135 |
+
| 4 | Reengagement decision | 0.50 | `<think>` reasoning visible, contextual |
|
| 136 |
+
| 5 | Regional comparison | 0.70 | Comparative framework |
|
| 137 |
+
|
| 138 |
+
**Mean validation reward: 0.54** vs SFT calibration baseline of **0.38** β **+42% improvement**
|
| 139 |
+
|
| 140 |
+
### Bimodal Performance Pattern
|
| 141 |
+
|
| 142 |
+
- **Strong (0.50β0.70):** Open-ended analysis, insights, comparison tasks
|
| 143 |
+
- **Weak (0.12):** Structured JSON extraction β the completion ceiling blocks the model from outputting complete JSON
|
| 144 |
+
|
| 145 |
+
---
|
| 146 |
+
|
| 147 |
+
## 5. Diagnosed Issues & Root Causes
|
| 148 |
+
|
| 149 |
+
### Issue 1: Entropy Collapse (Critical)
|
| 150 |
+
- **Symptom:** `clip_ratio=0` on all steps, KL=0.004
|
| 151 |
+
- **Root cause:** Policy entropy drops to near-zero β all 8 rollouts produce identical output β zero advantage β zero gradient (Skywork-OR1, 2505.22312)
|
| 152 |
+
- **Fix (v3):** Temperature=1.0, add entropy loss coefficient (Ξ±=5e-3), filter zero-advantage groups
|
| 153 |
+
|
| 154 |
+
### Issue 2: Completion Length Ceiling (Critical)
|
| 155 |
+
- **Symptom:** `completion_length=2048` (= `max_completion_length`) on every step
|
| 156 |
+
- **Root cause:** GRPO length bias inflates incorrect response length (Dr. GRPO, 2503.20783 Β§3.1). Model can't finish reasoning β gets low rewards β weak signal
|
| 157 |
+
- **Fix (v3):** Increase `max_completion_length` to 4096, reduce `num_generations` 8β4 to fit VRAM
|
| 158 |
+
|
| 159 |
+
### Issue 3: Data Scale (Moderate)
|
| 160 |
+
- **Symptom:** Early stopping at step 210 (70% of epoch), eval plateaued at 0.125
|
| 161 |
+
- **Root cause:** 300 prompts is below published minimums (1Kβ600K in literature)
|
| 162 |
+
- **Fix (v3):** Expand to 1000+ prompts via synthetic generation and data augmentation
|
| 163 |
+
|
| 164 |
+
---
|
| 165 |
+
|
| 166 |
+
## 6. Lessons Learned
|
| 167 |
+
|
| 168 |
+
### Technical Lessons
|
| 169 |
+
|
| 170 |
+
1. **Default model configs kill RL training.** Qwen3's `generation_config.json` sets `temperature=0.1`. This single default destroyed the first full training run. Always override generation parameters explicitly.
|
| 171 |
+
|
| 172 |
+
2. **Reward function design is the core ML engineering task.** Binary rewards β zero signal. Continuous rewards β training works. Multi-component rewards (format + content + quality) β staged learning where format converges first. The reward function IS the product specification.
|
| 173 |
+
|
| 174 |
+
3. **GRPO needs diversity to learn.** The algorithm is fundamentally about comparing different completions. Anything that reduces diversity (low temperature, small group size, few prompts, completion ceiling) directly reduces learning signal.
|
| 175 |
+
|
| 176 |
+
4. **TRL step calculation is non-obvious.** `steps = num_prompts Γ num_generations / (batch_size Γ grad_accum)`. Missing the `num_generations` multiplier gives wrong epoch estimates. `MAX_STEPS` always overrides `NUM_EPOCHS`.
|
| 177 |
+
|
| 178 |
+
5. **Early stopping needs tuning for generative models.** Patience must account for eval generation length. Short eval tokens β incomplete outputs β flat eval scores β premature stopping.
|
| 179 |
+
|
| 180 |
+
6. **Entropy collapse is the GRPO failure mode.** Not divergence, not reward hacking β the model collapses to deterministic output. Monitoring `clip_ratio` and generation entropy is essential.
|
| 181 |
+
|
| 182 |
+
### Business Lessons
|
| 183 |
+
|
| 184 |
+
1. **Domain data is the moat, not model size.** ThinkJSON (1.5B) beats DeepSeek-R1 (671B) on JSON extraction. A 7B model beats o3-mini on SQL. 300 Portuguese e-commerce examples already produced a model that outperforms SFT baseline by 42%.
|
| 185 |
+
|
| 186 |
+
2. **Cost arbitrage is immediate.** A self-hosted 3.7B model on a $0.50/hr GPU costs ~$0.001/analysis vs $0.01+ for API calls. Breakeven at ~100 analyses/day.
|
| 187 |
+
|
| 188 |
+
3. **Privacy is a feature.** Self-hosted means customer data never leaves the organization. This matters for LGPD (Brazilian data protection law) compliance.
|
| 189 |
+
|
| 190 |
+
4. **Portuguese-first is defensible.** Most LLM development is English-first. A model that deeply understands Brazilian e-commerce Portuguese ("veio com defeito", "nota 1 estrela") has a real competitive advantage.
|
| 191 |
+
|
| 192 |
+
5. **Budget 3-5 iterations, not 1.** The first run is diagnostic. v1 found the zero-signal bug. v2 found the temperature bug and completion ceiling. v3 will address entropy collapse. Each iteration is cheaper than the last because you know what to measure.
|
| 193 |
+
|
| 194 |
+
---
|
| 195 |
+
|
| 196 |
+
## 7. Next Steps
|
| 197 |
+
|
| 198 |
+
See `docs/ADR-001-next-steps.md` for detailed execution plans.
|
| 199 |
+
|
| 200 |
+
### Priority 1: Build Domain Benchmark (1-2 days)
|
| 201 |
+
50-100 held-out prompts, automated scoring, establish baselines
|
| 202 |
+
|
| 203 |
+
### Priority 2: Run Comparison vs. Qwen3-35B-A3B (1 day)
|
| 204 |
+
Prove small tuned model matches/beats large general model on domain tasks
|
| 205 |
+
|
| 206 |
+
### Priority 3: GRPO v3 Training Run (2-3 days)
|
| 207 |
+
Fix entropy collapse, increase completion length, expand training data
|
| 208 |
+
|
| 209 |
+
---
|
| 210 |
+
|
| 211 |
+
## References
|
| 212 |
+
|
| 213 |
+
| Paper | Key Finding | Relevance |
|
| 214 |
+
|-------|------------|-----------|
|
| 215 |
+
| DeepSeek-R1 (2501.12948) | SFTβGRPO pipeline, rule-based rewards | Architecture template |
|
| 216 |
+
| Dr. GRPO (2503.20783) | Remove std normalization, remove length bias | Fixes reward scaling |
|
| 217 |
+
| Skywork-OR1 MAGIC (2505.22312) | Entropy collapse diagnosis and fix | Explains clip_ratio=0 |
|
| 218 |
+
| MC-GRPO (2601.22582) | Median baseline for small rollout budgets | Fixes G=8 noise |
|
| 219 |
+
| ThinkJSON (2502.14905) | 1.5B beats 671B on JSON extraction | Proves domain specialization thesis |
|
| 220 |
+
| Reasoning-SQL (2503.23157) | 7B beats o3-mini on SQL with GRPO | Proves GRPO works for SQL |
|
| 221 |
+
| Cocktail Effect (2410.01109) | Multi-task SFT + general data boosts domain performance | SFT improvement recipe |
|