tucano2-commerce / docs /ADR-001-next-steps.md
rtferraz's picture
docs: add ADR-001 next steps with detailed execution plans
b47b36b verified
# ADR-001: Tucano2-Commerce Next Steps β€” Detailed Execution Plans
**Status:** Accepted
**Date:** 2025-04-23
**Author:** Rafael Ferraz
**Context:** GRPO v2 training completed (210/300 steps, early stopped). Mean validation reward 0.54 (+42% vs SFT baseline). Three critical issues diagnosed: entropy collapse, completion length ceiling, data scale. This ADR details the execution plan for the next phase.
---
## Phase 1: Build the Domain Benchmark
**Goal:** Create a rigorous, reproducible evaluation suite that measures Tucano2 across all 5 task types.
**Time estimate:** 1-2 days
**Prerequisite:** None β€” can start immediately
### Step 1.1: Design the Benchmark Prompts
Create **80 held-out prompts** (never seen in training), stratified by task:
| Task | Count | Source | Difficulty Mix |
|------|-------|--------|---------------|
| Structured Extraction | 20 | Real reviews from Olist/B2W datasets | 10 easy (clear sentiment) + 10 hard (mixed/ambiguous) |
| Sentiment Analysis | 15 | Real reviews, balanced pos/neg/neutral | 5 per polarity |
| SQL Generation | 15 | Business questions against your e-commerce schema | 5 simple SELECT + 5 JOIN + 5 aggregate/window |
| Churn/Risk Prediction | 15 | Customer profiles with known outcomes | 5 low-risk + 5 medium + 5 high-risk |
| Business Insights | 15 | Open-ended analytical questions | 5 comparison + 5 trend + 5 recommendation |
**Implementation:**
```python
# File: benchmark/prompts.jsonl
# Each line is a JSON object:
{
"id": "ext-001",
"task": "extraction",
"difficulty": "hard",
"prompt": "Analise esta avaliaΓ§Γ£o...",
"system": "<your system prompt>",
"ground_truth": {
"sentiment": "negativo",
"sentiment_score": -0.6,
"churn_risk": 0.8,
...
},
"notes": "Mixed sentiment β€” product good but delivery terrible"
}
```
**Key principles:**
- Include edge cases: mixed sentiment, sarcasm, regional slang ("barato saiu caro"), incomplete reviews
- SQL prompts must have executable ground truth queries against your actual schema
- Insights prompts should require multi-step reasoning, not one-hop lookups
### Step 1.2: Build Automated Scoring Functions
Each task type gets its own scorer:
#### Extraction Scorer
```python
def score_extraction(prediction: dict, ground_truth: dict) -> dict:
"""Score each of the 10 JSON fields independently."""
scores = {}
# Categorical fields: exact match
for field in ["sentiment", "complaint_category"]:
scores[field] = 1.0 if pred[field] == gt[field] else 0.0
# Numeric fields: distance-based
for field in ["sentiment_score", "churn_risk", "repeat_intent"]:
scores[field] = max(0, 1.0 - abs(pred[field] - gt[field]))
# Boolean fields: exact match
for field in ["delivery_issue", "product_issue", "seller_issue", "would_recommend"]:
scores[field] = 1.0 if pred[field] == gt[field] else 0.0
# Text fields: semantic similarity (sentence-transformers)
for field in ["main_complaint"]:
scores[field] = cosine_similarity(embed(pred[field]), embed(gt[field]))
scores["mean"] = sum(scores.values()) / len(scores)
return scores
```
#### SQL Scorer
```python
def score_sql(predicted_sql: str, ground_truth_sql: str, db_connection) -> dict:
"""Execute both queries and compare result sets."""
try:
pred_results = db_connection.execute(predicted_sql).fetchall()
gt_results = db_connection.execute(ground_truth_sql).fetchall()
# Execution accuracy (EX): do the result sets match?
ex = 1.0 if set(pred_results) == set(gt_results) else 0.0
# Partial credit: row overlap
overlap = len(set(pred_results) & set(gt_results))
partial = overlap / max(len(gt_results), 1)
return {"ex": ex, "partial": partial, "syntax_valid": 1.0}
except Exception:
return {"ex": 0.0, "partial": 0.0, "syntax_valid": 0.0}
```
#### Sentiment Scorer
```python
def score_sentiment(prediction: str, ground_truth: str) -> dict:
"""Exact match on polarity, distance on score."""
polarity_match = 1.0 if pred_polarity == gt_polarity else 0.0
score_distance = max(0, 1.0 - abs(pred_score - gt_score) / 2.0)
return {"polarity": polarity_match, "score": score_distance}
```
#### Churn Scorer
```python
def score_churn(prediction: float, ground_truth: float, threshold=0.5) -> dict:
"""Binary accuracy + calibration."""
binary = 1.0 if (prediction >= threshold) == (ground_truth >= threshold) else 0.0
calibration = max(0, 1.0 - abs(prediction - ground_truth))
return {"binary_accuracy": binary, "calibration": calibration}
```
#### Insights Scorer (LLM-as-Judge)
```python
JUDGE_PROMPT = """You are evaluating a Brazilian e-commerce analysis response.
Rate on 4 dimensions (1-5 each):
1. Relevance: Does it address the question?
2. Accuracy: Are the claims factually reasonable?
3. Actionability: Could a business act on this analysis?
4. Portuguese Quality: Is the PT-BR natural and professional?
Response to evaluate:
{response}
Question asked:
{question}
Return JSON: {"relevance": X, "accuracy": X, "actionability": X, "portuguese": X}"""
def score_insights(response: str, question: str) -> dict:
"""Use GPT-4o as judge, run 3x and average."""
scores = []
for _ in range(3):
result = openai.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": JUDGE_PROMPT.format(
response=response, question=question
)}],
temperature=0.3
)
scores.append(json.loads(result.choices[0].message.content))
# Average across 3 runs
return {k: sum(s[k] for s in scores) / 3 for k in scores[0]}
```
### Step 1.3: Run the Benchmark Script
```python
# File: benchmark/run_benchmark.py
import json
from pathlib import Path
def run_benchmark(model, tokenizer, prompts_path, output_path):
prompts = [json.loads(line) for line in open(prompts_path)]
results = []
for prompt in prompts:
# Generate response
messages = [
{"role": "system", "content": prompt["system"]},
{"role": "user", "content": prompt["prompt"]}
]
response = generate(model, tokenizer, messages, max_new_tokens=2048, temperature=0.1)
# Score based on task type
scorer = SCORERS[prompt["task"]]
score = scorer(parse_response(response), prompt.get("ground_truth"))
results.append({
"id": prompt["id"],
"task": prompt["task"],
"difficulty": prompt["difficulty"],
"response": response,
"scores": score,
"tokens": len(tokenizer.encode(response))
})
# Save results
with open(output_path, "w") as f:
json.dump(results, f, indent=2, ensure_ascii=False)
# Print summary
for task in ["extraction", "sentiment", "sql", "churn", "insights"]:
task_results = [r for r in results if r["task"] == task]
mean_score = sum(r["scores"]["mean"] for r in task_results) / len(task_results)
print(f"{task}: {mean_score:.3f} (n={len(task_results)})")
SCORERS = {
"extraction": score_extraction,
"sentiment": score_sentiment,
"sql": score_sql,
"churn": score_churn,
"insights": score_insights,
}
```
### Step 1.4: Establish Baselines
Run the benchmark on:
1. **Qwen3-3.7B base** (zero-shot, no adapters) β€” this is your floor
2. **Tucano2-SFT** (SFT adapter only) β€” this is your pre-GRPO baseline
3. **Tucano2-GRPO v2** (current best) β€” this is what you're improving
Record all results in a comparison table.
---
## Phase 2: Run Comparison vs. Qwen3-35B-A3B
**Goal:** Prove domain-tuned 3.7B matches or beats general 35B on e-commerce tasks.
**Time estimate:** 1 day (after benchmark is built)
**Prerequisite:** Phase 1 complete
### Step 2.1: Set Up Qwen3-35B-A3B
This is a Mixture-of-Experts model (35B total, ~3B active per token). It should fit on your L4 with 4-bit quantization.
```python
from unsloth import FastLanguageModel
model_35b, tokenizer_35b = FastLanguageModel.from_pretrained(
model_name="Qwen/Qwen3-35B-A3B", # Verify exact HF repo name
max_seq_length=4096,
load_in_4bit=True,
dtype=None, # Auto-detect
)
FastLanguageModel.for_inference(model_35b)
```
**Memory estimate:** 35B params Γ— 0.5 bytes (4-bit) β‰ˆ 17.5GB. Active inference ~3B β‰ˆ 3GB compute. Should fit on L4 (24GB) with headroom.
**If it doesn't fit:** Use `transformers` with `BitsAndBytesConfig(load_in_4bit=True)` instead of Unsloth, or try GGUF via `llama.cpp`.
### Step 2.2: Run the Same Benchmark
```python
# Run identical benchmark on Qwen3-35B-A3B
run_benchmark(model_35b, tokenizer_35b, "benchmark/prompts.jsonl", "results/qwen3-35b.json")
```
**Critical:** Use the same system prompt, same temperature (0.1 for eval), same max_new_tokens. The only variable should be the model.
### Step 2.3: Optional β€” Add GPT-4o Baseline
```python
# For the strongest reference point
import openai
def run_benchmark_api(prompts_path, output_path, model="gpt-4o"):
prompts = [json.loads(line) for line in open(prompts_path)]
results = []
for prompt in prompts:
response = openai.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": prompt["system"]},
{"role": "user", "content": prompt["prompt"]}
],
temperature=0.1
)
# ... score same as above
```
### Step 2.4: Build the Comparison Report
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Task β”‚ Qwen3-3.7B β”‚ Tucano2-SFTβ”‚ Tucano2- β”‚ Qwen3-35B β”‚ GPT-4o β”‚
β”‚ β”‚ (base) β”‚ β”‚ GRPO v2 β”‚ (zero-shot)β”‚(zero-shotβ”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Extraction β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ Sentiment β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ SQL β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ Churn β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ Insights β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ MEAN β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Cost/query β”‚ $0.001 β”‚ $0.001 β”‚ $0.001 β”‚ $0.003 β”‚ $0.010 β”‚
β”‚ Latency (s) β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ Privacy β”‚ βœ… On-prem β”‚ βœ… On-prem β”‚ βœ… On-prem β”‚ βœ… On-prem β”‚ ❌ API β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
**The story to tell:**
- If Tucano2-GRPO β‰₯ Qwen3-35B on extraction/sentiment/SQL: "Domain tuning eliminates the need for 10Γ— larger models"
- If Tucano2-GRPO < Qwen3-35B on insights: "Open-ended reasoning still benefits from scale, but structured tasks don't"
- Cost column proves the business case regardless of performance parity
---
## Phase 3: GRPO v3 Training Run
**Goal:** Fix entropy collapse, break through the v2 performance plateau.
**Time estimate:** 2-3 days (including data prep)
**Prerequisite:** Phase 1 (benchmark needed to measure improvement)
### Step 3.1: Expand Training Data (1000+ prompts)
**Target: 1000 prompts** (up from 300), stratified:
| Task | Current | Target | How to Expand |
|------|---------|--------|--------------|
| Extraction | ~75 | 300 | Generate from Olist dataset β€” sample reviews, create ground truth JSON |
| Sentiment | ~60 | 200 | Sample from B2W reviews corpus, label with existing model + human review |
| SQL | ~75 | 250 | Template-based: vary table names, WHERE clauses, aggregation patterns |
| Churn | ~45 | 100 | Augment customer profiles with synthetic variations |
| Insights | ~45 | 150 | LLM-generated analytical questions about e-commerce scenarios |
**Synthetic generation recipe (from Cocktail Effect paper):**
```python
# Use your SFT model or GPT-4o to generate new training prompts
# Then manually verify/correct the ground truth
def generate_extraction_prompts(reviews_df, n=200):
prompts = []
for _, row in reviews_df.sample(n).iterrows():
prompt = format_extraction_prompt(row["review_text"], row["score"], row["status"])
# Generate ground truth with GPT-4o (higher quality than self-labeling)
ground_truth = gpt4o_extract(prompt)
prompts.append({"prompt": prompt, "ground_truth": ground_truth})
return prompts
```
**Also add 30% general reasoning data** (Cocktail Effect paper finding):
- Source: Portuguese subset of Orca-Math or translated OpenOrca
- Purpose: Regularization β€” prevents model from overfitting to domain patterns
- Mix ratio: 700 domain + 300 general = 1000 total
### Step 3.2: Configuration Changes for v3
```python
# === GRPO v3 Config Changes ===
# FIX 1: Temperature β€” prevent entropy collapse
TEMPERATURE = 1.0 # Was 0.8 in v2. All published GRPO papers use 1.0.
# Reference: Skywork-OR1 (2505.22312) ablation shows Ο„=1.0 >> Ο„=0.6
# FIX 2: Completion length β€” remove the ceiling
MAX_COMPLETION_LENGTH = 4096 # Was 2048. Every v2 completion hit the ceiling.
# Trade-off: halve num_generations to fit VRAM
# FIX 3: Reduce generations to fit longer completions
NUM_GENERATIONS = 4 # Was 8. 4 Γ— 4096 β‰ˆ 8 Γ— 2048 in VRAM terms.
# MC-GRPO paper shows G=4 can work if using median baseline
# FIX 4: Explicit Ξ²=0 (no KL penalty)
# Dr. GRPO paper: Ξ²=0 is optimal for rule-based rewards
BETA = 0.0
# FIX 5: Learning rate β€” slightly more aggressive
LEARNING_RATE = 3e-6 # Was 2e-6. Clip ratios were all 0 β†’ room to push harder.
# Still well within published range (1e-6 to 5e-6)
# FIX 6: Max steps for expanded data
# 1000 prompts Γ— 4 generations / (4 batch Γ— 2 accum) = 500 steps/epoch
MAX_STEPS = 500
# FIX 7: Early stopping β€” more generous
EARLY_STOPPING_PATIENCE = 15 # 15 evals Γ— 10 steps = 150 steps of runway
EVAL_STEPS = 10
SAVE_STEPS = 10
```
### Step 3.3: Add Entropy Monitoring (Critical for v3)
Since TRL 0.24.0 doesn't have native entropy bonus, implement via callback:
```python
class EntropyMonitorCallback(TrainerCallback):
"""Monitor policy entropy to detect collapse early."""
def on_log(self, args, state, control, logs=None, **kwargs):
if logs and "train/completion_length" in logs:
# Proxy for entropy: if all completions are max length,
# entropy is collapsing (model is stuck in one mode)
completion_ratio = logs["train/completion_length"] / MAX_COMPLETION_LENGTH
if completion_ratio > 0.95:
print(f"⚠️ Step {state.global_step}: Completion ratio {completion_ratio:.2f} β€” "
f"possible entropy collapse. Monitor reward_std.")
# Log to W&B
if wandb.run:
wandb.log({
"monitor/completion_ratio": completion_ratio,
"monitor/entropy_proxy": 1.0 - completion_ratio,
}, step=state.global_step)
```
### Step 3.4: Add Zero-Advantage Group Filtering
```python
# In the reward function wrapper or custom trainer:
def filtered_commerce_reward_fn(completions, prompts, **kwargs):
"""Compute rewards and flag zero-variance groups for filtering."""
rewards = commerce_reward_fn(completions, prompts, **kwargs)
# Group rewards by prompt (each prompt has NUM_GENERATIONS completions)
for i in range(0, len(rewards), NUM_GENERATIONS):
group = rewards[i:i+NUM_GENERATIONS]
if max(group) - min(group) < 0.01: # Near-zero variance
# Add small noise to break the tie
# This prevents the GRPO denominator from exploding
for j in range(i, i+NUM_GENERATIONS):
rewards[j] += random.gauss(0, 0.005)
return rewards
```
**Note:** The proper fix is MC-GRPO's median baseline, but that requires modifying TRL internals. The noise injection is a pragmatic workaround for TRL 0.24.0.
### Step 3.5: Reward Function Refinement
Split the composite reward into staged components (Reasoning-SQL paper finding):
```python
def commerce_reward_fn_v3(completions, prompts, **kwargs):
"""Multi-component reward with staged convergence."""
rewards = []
for completion, prompt in zip(completions, prompts):
# Stage 1: Format reward (converges first)
r_format = score_format(completion) # JSON valid? Think tags closed? Right structure?
# Stage 2: Partial content reward
r_partial = score_partial_content(completion, prompt) # Some fields correct?
# Stage 3: Full task reward
r_task = score_full_task(completion, prompt) # All fields correct? SQL executes?
# Weighted combination β€” format weight decreases over training
reward = 0.2 * r_format + 0.3 * r_partial + 0.5 * r_task
rewards.append(reward)
return rewards
```
### Step 3.6: VRAM Budget for v3
```
L4 GPU: 24GB total
Model (NF4): ~3.5GB
KV Cache (4096 seq): ~2.0GB
Activations: ~4.0GB
Optimizer states: ~3.0GB
Generations (4Γ—4096): ~8.0GB
─────────────────────────────
Estimated total: ~20.5GB
Headroom: ~3.5GB
βœ… Should fit. If OOM: reduce MAX_COMPLETION_LENGTH to 3072 first.
```
### Step 3.7: Training Execution
```bash
# Pre-flight checklist:
# βœ… Benchmark built and baselines recorded (Phase 1)
# βœ… 1000+ prompts prepared and validated
# βœ… Config changes applied (temperature, completion length, generations, LR)
# βœ… Entropy monitor callback added
# βœ… Zero-advantage filtering active
# βœ… Reward function v3 with staged components
# βœ… FRESH=True (nuke old checkpoints)
# βœ… W&B run name: grpo-v3-l4-{timestamp}
# Expected runtime: 500 steps Γ— ~5 min/step (longer completions) β‰ˆ 42 hours
# Checkpoints every 10 steps
# Early stopping patience: 15 evals (150 steps)
```
### Step 3.8: Post-Training Validation
1. Run Phase 1 benchmark on GRPO v3 best checkpoint
2. Compare against all baselines:
- Qwen3-3.7B base β†’ Tucano2-SFT β†’ Tucano2-GRPO-v2 β†’ **Tucano2-GRPO-v3**
3. If v3 > v2 on benchmark: save as production model
4. If v3 β‰ˆ v2: entropy collapse not fully fixed; consider switching to MC-GRPO or upgrading GPU for longer completions
5. If v3 < v2: rollback, investigate β€” likely reward function regression
---
## Decision Criteria for Stopping
| Condition | Action |
|-----------|--------|
| v3 eval reward > 0.20 AND extraction score > 0.40 | Ship it β€” significantly better than v2 |
| v3 eval reward 0.15-0.20, improving trend | Run epoch 2 (extend MAX_STEPS to 1000) |
| v3 eval reward < v2 (0.125) | Stop, diagnose, review reward function and data |
| Entropy collapse again (clip_ratio=0 after step 50) | Add entropy bonus via custom loss (requires TRL fork) |
| OOM | Reduce MAX_COMPLETION_LENGTH to 3072 β†’ 2560 β†’ 2048 |
---
## Appendix: Literature References for Each Fix
| Fix | Paper | Section | Key Finding |
|-----|-------|---------|-------------|
| Temperature=1.0 | Skywork-OR1 (2505.22312) | Β§4, Table 3 | Ο„=1.0 gives 5-8% better performance, delays entropy collapse |
| Ξ²=0 (no KL) | Dr. GRPO (2503.20783) | Β§3.2 | KL penalty unnecessary for rule-based rewards |
| scale_rewards=False | Dr. GRPO (2503.20783) | Β§3.1 | Std normalization biases toward low-variance groups |
| Longer completions | Dr. GRPO (2503.20783) | Β§3.1 | GRPO length bias inflates wrong answers β†’ ceiling hit |
| Zero-advantage filtering | Skywork-OR1 (2505.22312) | Β§3.1 | Zero-std groups destabilize training |
| Staged rewards | Reasoning-SQL (2503.23157) | Β§3.2 | Format rewards converge first, enable task learning |
| General data mixing | Cocktail Effect (2410.01109) | Β§4 | 30% general data improves domain performance 2-15% |
| G=4 with median | MC-GRPO (2601.22582) | Β§3 | Median baseline reduces noise at small group sizes |