ADR-001: Tucano2-Commerce Next Steps β Detailed Execution Plans
Status: Accepted
Date: 2025-04-23
Author: Rafael Ferraz
Context: GRPO v2 training completed (210/300 steps, early stopped). Mean validation reward 0.54 (+42% vs SFT baseline). Three critical issues diagnosed: entropy collapse, completion length ceiling, data scale. This ADR details the execution plan for the next phase.
Phase 1: Build the Domain Benchmark
Goal: Create a rigorous, reproducible evaluation suite that measures Tucano2 across all 5 task types.
Time estimate: 1-2 days
Prerequisite: None β can start immediately
Step 1.1: Design the Benchmark Prompts
Create 80 held-out prompts (never seen in training), stratified by task:
| Task | Count | Source | Difficulty Mix |
|---|---|---|---|
| Structured Extraction | 20 | Real reviews from Olist/B2W datasets | 10 easy (clear sentiment) + 10 hard (mixed/ambiguous) |
| Sentiment Analysis | 15 | Real reviews, balanced pos/neg/neutral | 5 per polarity |
| SQL Generation | 15 | Business questions against your e-commerce schema | 5 simple SELECT + 5 JOIN + 5 aggregate/window |
| Churn/Risk Prediction | 15 | Customer profiles with known outcomes | 5 low-risk + 5 medium + 5 high-risk |
| Business Insights | 15 | Open-ended analytical questions | 5 comparison + 5 trend + 5 recommendation |
Implementation:
# File: benchmark/prompts.jsonl
# Each line is a JSON object:
{
"id": "ext-001",
"task": "extraction",
"difficulty": "hard",
"prompt": "Analise esta avaliaΓ§Γ£o...",
"system": "<your system prompt>",
"ground_truth": {
"sentiment": "negativo",
"sentiment_score": -0.6,
"churn_risk": 0.8,
...
},
"notes": "Mixed sentiment β product good but delivery terrible"
}
Key principles:
- Include edge cases: mixed sentiment, sarcasm, regional slang ("barato saiu caro"), incomplete reviews
- SQL prompts must have executable ground truth queries against your actual schema
- Insights prompts should require multi-step reasoning, not one-hop lookups
Step 1.2: Build Automated Scoring Functions
Each task type gets its own scorer:
Extraction Scorer
def score_extraction(prediction: dict, ground_truth: dict) -> dict:
"""Score each of the 10 JSON fields independently."""
scores = {}
# Categorical fields: exact match
for field in ["sentiment", "complaint_category"]:
scores[field] = 1.0 if pred[field] == gt[field] else 0.0
# Numeric fields: distance-based
for field in ["sentiment_score", "churn_risk", "repeat_intent"]:
scores[field] = max(0, 1.0 - abs(pred[field] - gt[field]))
# Boolean fields: exact match
for field in ["delivery_issue", "product_issue", "seller_issue", "would_recommend"]:
scores[field] = 1.0 if pred[field] == gt[field] else 0.0
# Text fields: semantic similarity (sentence-transformers)
for field in ["main_complaint"]:
scores[field] = cosine_similarity(embed(pred[field]), embed(gt[field]))
scores["mean"] = sum(scores.values()) / len(scores)
return scores
SQL Scorer
def score_sql(predicted_sql: str, ground_truth_sql: str, db_connection) -> dict:
"""Execute both queries and compare result sets."""
try:
pred_results = db_connection.execute(predicted_sql).fetchall()
gt_results = db_connection.execute(ground_truth_sql).fetchall()
# Execution accuracy (EX): do the result sets match?
ex = 1.0 if set(pred_results) == set(gt_results) else 0.0
# Partial credit: row overlap
overlap = len(set(pred_results) & set(gt_results))
partial = overlap / max(len(gt_results), 1)
return {"ex": ex, "partial": partial, "syntax_valid": 1.0}
except Exception:
return {"ex": 0.0, "partial": 0.0, "syntax_valid": 0.0}
Sentiment Scorer
def score_sentiment(prediction: str, ground_truth: str) -> dict:
"""Exact match on polarity, distance on score."""
polarity_match = 1.0 if pred_polarity == gt_polarity else 0.0
score_distance = max(0, 1.0 - abs(pred_score - gt_score) / 2.0)
return {"polarity": polarity_match, "score": score_distance}
Churn Scorer
def score_churn(prediction: float, ground_truth: float, threshold=0.5) -> dict:
"""Binary accuracy + calibration."""
binary = 1.0 if (prediction >= threshold) == (ground_truth >= threshold) else 0.0
calibration = max(0, 1.0 - abs(prediction - ground_truth))
return {"binary_accuracy": binary, "calibration": calibration}
Insights Scorer (LLM-as-Judge)
JUDGE_PROMPT = """You are evaluating a Brazilian e-commerce analysis response.
Rate on 4 dimensions (1-5 each):
1. Relevance: Does it address the question?
2. Accuracy: Are the claims factually reasonable?
3. Actionability: Could a business act on this analysis?
4. Portuguese Quality: Is the PT-BR natural and professional?
Response to evaluate:
{response}
Question asked:
{question}
Return JSON: {"relevance": X, "accuracy": X, "actionability": X, "portuguese": X}"""
def score_insights(response: str, question: str) -> dict:
"""Use GPT-4o as judge, run 3x and average."""
scores = []
for _ in range(3):
result = openai.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": JUDGE_PROMPT.format(
response=response, question=question
)}],
temperature=0.3
)
scores.append(json.loads(result.choices[0].message.content))
# Average across 3 runs
return {k: sum(s[k] for s in scores) / 3 for k in scores[0]}
Step 1.3: Run the Benchmark Script
# File: benchmark/run_benchmark.py
import json
from pathlib import Path
def run_benchmark(model, tokenizer, prompts_path, output_path):
prompts = [json.loads(line) for line in open(prompts_path)]
results = []
for prompt in prompts:
# Generate response
messages = [
{"role": "system", "content": prompt["system"]},
{"role": "user", "content": prompt["prompt"]}
]
response = generate(model, tokenizer, messages, max_new_tokens=2048, temperature=0.1)
# Score based on task type
scorer = SCORERS[prompt["task"]]
score = scorer(parse_response(response), prompt.get("ground_truth"))
results.append({
"id": prompt["id"],
"task": prompt["task"],
"difficulty": prompt["difficulty"],
"response": response,
"scores": score,
"tokens": len(tokenizer.encode(response))
})
# Save results
with open(output_path, "w") as f:
json.dump(results, f, indent=2, ensure_ascii=False)
# Print summary
for task in ["extraction", "sentiment", "sql", "churn", "insights"]:
task_results = [r for r in results if r["task"] == task]
mean_score = sum(r["scores"]["mean"] for r in task_results) / len(task_results)
print(f"{task}: {mean_score:.3f} (n={len(task_results)})")
SCORERS = {
"extraction": score_extraction,
"sentiment": score_sentiment,
"sql": score_sql,
"churn": score_churn,
"insights": score_insights,
}
Step 1.4: Establish Baselines
Run the benchmark on:
- Qwen3-3.7B base (zero-shot, no adapters) β this is your floor
- Tucano2-SFT (SFT adapter only) β this is your pre-GRPO baseline
- Tucano2-GRPO v2 (current best) β this is what you're improving
Record all results in a comparison table.
Phase 2: Run Comparison vs. Qwen3-35B-A3B
Goal: Prove domain-tuned 3.7B matches or beats general 35B on e-commerce tasks.
Time estimate: 1 day (after benchmark is built)
Prerequisite: Phase 1 complete
Step 2.1: Set Up Qwen3-35B-A3B
This is a Mixture-of-Experts model (35B total, ~3B active per token). It should fit on your L4 with 4-bit quantization.
from unsloth import FastLanguageModel
model_35b, tokenizer_35b = FastLanguageModel.from_pretrained(
model_name="Qwen/Qwen3-35B-A3B", # Verify exact HF repo name
max_seq_length=4096,
load_in_4bit=True,
dtype=None, # Auto-detect
)
FastLanguageModel.for_inference(model_35b)
Memory estimate: 35B params Γ 0.5 bytes (4-bit) β 17.5GB. Active inference ~3B β 3GB compute. Should fit on L4 (24GB) with headroom.
If it doesn't fit: Use transformers with BitsAndBytesConfig(load_in_4bit=True) instead of Unsloth, or try GGUF via llama.cpp.
Step 2.2: Run the Same Benchmark
# Run identical benchmark on Qwen3-35B-A3B
run_benchmark(model_35b, tokenizer_35b, "benchmark/prompts.jsonl", "results/qwen3-35b.json")
Critical: Use the same system prompt, same temperature (0.1 for eval), same max_new_tokens. The only variable should be the model.
Step 2.3: Optional β Add GPT-4o Baseline
# For the strongest reference point
import openai
def run_benchmark_api(prompts_path, output_path, model="gpt-4o"):
prompts = [json.loads(line) for line in open(prompts_path)]
results = []
for prompt in prompts:
response = openai.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": prompt["system"]},
{"role": "user", "content": prompt["prompt"]}
],
temperature=0.1
)
# ... score same as above
Step 2.4: Build the Comparison Report
ββββββββββββββββ¬βββββββββββββ¬βββββββββββββ¬βββββββββββββ¬βββββββββββββ¬βββββββββββ
β Task β Qwen3-3.7B β Tucano2-SFTβ Tucano2- β Qwen3-35B β GPT-4o β
β β (base) β β GRPO v2 β (zero-shot)β(zero-shotβ
ββββββββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββ€
β Extraction β β β β β β
β Sentiment β β β β β β
β SQL β β β β β β
β Churn β β β β β β
β Insights β β β β β β
ββββββββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββ€
β MEAN β β β β β β
ββββββββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββ€
β Cost/query β $0.001 β $0.001 β $0.001 β $0.003 β $0.010 β
β Latency (s) β β β β β β
β Privacy β β
On-prem β β
On-prem β β
On-prem β β
On-prem β β API β
ββββββββββββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββ
The story to tell:
- If Tucano2-GRPO β₯ Qwen3-35B on extraction/sentiment/SQL: "Domain tuning eliminates the need for 10Γ larger models"
- If Tucano2-GRPO < Qwen3-35B on insights: "Open-ended reasoning still benefits from scale, but structured tasks don't"
- Cost column proves the business case regardless of performance parity
Phase 3: GRPO v3 Training Run
Goal: Fix entropy collapse, break through the v2 performance plateau.
Time estimate: 2-3 days (including data prep)
Prerequisite: Phase 1 (benchmark needed to measure improvement)
Step 3.1: Expand Training Data (1000+ prompts)
Target: 1000 prompts (up from 300), stratified:
| Task | Current | Target | How to Expand |
|---|---|---|---|
| Extraction | ~75 | 300 | Generate from Olist dataset β sample reviews, create ground truth JSON |
| Sentiment | ~60 | 200 | Sample from B2W reviews corpus, label with existing model + human review |
| SQL | ~75 | 250 | Template-based: vary table names, WHERE clauses, aggregation patterns |
| Churn | ~45 | 100 | Augment customer profiles with synthetic variations |
| Insights | ~45 | 150 | LLM-generated analytical questions about e-commerce scenarios |
Synthetic generation recipe (from Cocktail Effect paper):
# Use your SFT model or GPT-4o to generate new training prompts
# Then manually verify/correct the ground truth
def generate_extraction_prompts(reviews_df, n=200):
prompts = []
for _, row in reviews_df.sample(n).iterrows():
prompt = format_extraction_prompt(row["review_text"], row["score"], row["status"])
# Generate ground truth with GPT-4o (higher quality than self-labeling)
ground_truth = gpt4o_extract(prompt)
prompts.append({"prompt": prompt, "ground_truth": ground_truth})
return prompts
Also add 30% general reasoning data (Cocktail Effect paper finding):
- Source: Portuguese subset of Orca-Math or translated OpenOrca
- Purpose: Regularization β prevents model from overfitting to domain patterns
- Mix ratio: 700 domain + 300 general = 1000 total
Step 3.2: Configuration Changes for v3
# === GRPO v3 Config Changes ===
# FIX 1: Temperature β prevent entropy collapse
TEMPERATURE = 1.0 # Was 0.8 in v2. All published GRPO papers use 1.0.
# Reference: Skywork-OR1 (2505.22312) ablation shows Ο=1.0 >> Ο=0.6
# FIX 2: Completion length β remove the ceiling
MAX_COMPLETION_LENGTH = 4096 # Was 2048. Every v2 completion hit the ceiling.
# Trade-off: halve num_generations to fit VRAM
# FIX 3: Reduce generations to fit longer completions
NUM_GENERATIONS = 4 # Was 8. 4 Γ 4096 β 8 Γ 2048 in VRAM terms.
# MC-GRPO paper shows G=4 can work if using median baseline
# FIX 4: Explicit Ξ²=0 (no KL penalty)
# Dr. GRPO paper: Ξ²=0 is optimal for rule-based rewards
BETA = 0.0
# FIX 5: Learning rate β slightly more aggressive
LEARNING_RATE = 3e-6 # Was 2e-6. Clip ratios were all 0 β room to push harder.
# Still well within published range (1e-6 to 5e-6)
# FIX 6: Max steps for expanded data
# 1000 prompts Γ 4 generations / (4 batch Γ 2 accum) = 500 steps/epoch
MAX_STEPS = 500
# FIX 7: Early stopping β more generous
EARLY_STOPPING_PATIENCE = 15 # 15 evals Γ 10 steps = 150 steps of runway
EVAL_STEPS = 10
SAVE_STEPS = 10
Step 3.3: Add Entropy Monitoring (Critical for v3)
Since TRL 0.24.0 doesn't have native entropy bonus, implement via callback:
class EntropyMonitorCallback(TrainerCallback):
"""Monitor policy entropy to detect collapse early."""
def on_log(self, args, state, control, logs=None, **kwargs):
if logs and "train/completion_length" in logs:
# Proxy for entropy: if all completions are max length,
# entropy is collapsing (model is stuck in one mode)
completion_ratio = logs["train/completion_length"] / MAX_COMPLETION_LENGTH
if completion_ratio > 0.95:
print(f"β οΈ Step {state.global_step}: Completion ratio {completion_ratio:.2f} β "
f"possible entropy collapse. Monitor reward_std.")
# Log to W&B
if wandb.run:
wandb.log({
"monitor/completion_ratio": completion_ratio,
"monitor/entropy_proxy": 1.0 - completion_ratio,
}, step=state.global_step)
Step 3.4: Add Zero-Advantage Group Filtering
# In the reward function wrapper or custom trainer:
def filtered_commerce_reward_fn(completions, prompts, **kwargs):
"""Compute rewards and flag zero-variance groups for filtering."""
rewards = commerce_reward_fn(completions, prompts, **kwargs)
# Group rewards by prompt (each prompt has NUM_GENERATIONS completions)
for i in range(0, len(rewards), NUM_GENERATIONS):
group = rewards[i:i+NUM_GENERATIONS]
if max(group) - min(group) < 0.01: # Near-zero variance
# Add small noise to break the tie
# This prevents the GRPO denominator from exploding
for j in range(i, i+NUM_GENERATIONS):
rewards[j] += random.gauss(0, 0.005)
return rewards
Note: The proper fix is MC-GRPO's median baseline, but that requires modifying TRL internals. The noise injection is a pragmatic workaround for TRL 0.24.0.
Step 3.5: Reward Function Refinement
Split the composite reward into staged components (Reasoning-SQL paper finding):
def commerce_reward_fn_v3(completions, prompts, **kwargs):
"""Multi-component reward with staged convergence."""
rewards = []
for completion, prompt in zip(completions, prompts):
# Stage 1: Format reward (converges first)
r_format = score_format(completion) # JSON valid? Think tags closed? Right structure?
# Stage 2: Partial content reward
r_partial = score_partial_content(completion, prompt) # Some fields correct?
# Stage 3: Full task reward
r_task = score_full_task(completion, prompt) # All fields correct? SQL executes?
# Weighted combination β format weight decreases over training
reward = 0.2 * r_format + 0.3 * r_partial + 0.5 * r_task
rewards.append(reward)
return rewards
Step 3.6: VRAM Budget for v3
L4 GPU: 24GB total
Model (NF4): ~3.5GB
KV Cache (4096 seq): ~2.0GB
Activations: ~4.0GB
Optimizer states: ~3.0GB
Generations (4Γ4096): ~8.0GB
βββββββββββββββββββββββββββββ
Estimated total: ~20.5GB
Headroom: ~3.5GB
β
Should fit. If OOM: reduce MAX_COMPLETION_LENGTH to 3072 first.
Step 3.7: Training Execution
# Pre-flight checklist:
# β
Benchmark built and baselines recorded (Phase 1)
# β
1000+ prompts prepared and validated
# β
Config changes applied (temperature, completion length, generations, LR)
# β
Entropy monitor callback added
# β
Zero-advantage filtering active
# β
Reward function v3 with staged components
# β
FRESH=True (nuke old checkpoints)
# β
W&B run name: grpo-v3-l4-{timestamp}
# Expected runtime: 500 steps Γ ~5 min/step (longer completions) β 42 hours
# Checkpoints every 10 steps
# Early stopping patience: 15 evals (150 steps)
Step 3.8: Post-Training Validation
- Run Phase 1 benchmark on GRPO v3 best checkpoint
- Compare against all baselines:
- Qwen3-3.7B base β Tucano2-SFT β Tucano2-GRPO-v2 β Tucano2-GRPO-v3
- If v3 > v2 on benchmark: save as production model
- If v3 β v2: entropy collapse not fully fixed; consider switching to MC-GRPO or upgrading GPU for longer completions
- If v3 < v2: rollback, investigate β likely reward function regression
Decision Criteria for Stopping
| Condition | Action |
|---|---|
| v3 eval reward > 0.20 AND extraction score > 0.40 | Ship it β significantly better than v2 |
| v3 eval reward 0.15-0.20, improving trend | Run epoch 2 (extend MAX_STEPS to 1000) |
| v3 eval reward < v2 (0.125) | Stop, diagnose, review reward function and data |
| Entropy collapse again (clip_ratio=0 after step 50) | Add entropy bonus via custom loss (requires TRL fork) |
| OOM | Reduce MAX_COMPLETION_LENGTH to 3072 β 2560 β 2048 |
Appendix: Literature References for Each Fix
| Fix | Paper | Section | Key Finding |
|---|---|---|---|
| Temperature=1.0 | Skywork-OR1 (2505.22312) | Β§4, Table 3 | Ο=1.0 gives 5-8% better performance, delays entropy collapse |
| Ξ²=0 (no KL) | Dr. GRPO (2503.20783) | Β§3.2 | KL penalty unnecessary for rule-based rewards |
| scale_rewards=False | Dr. GRPO (2503.20783) | Β§3.1 | Std normalization biases toward low-variance groups |
| Longer completions | Dr. GRPO (2503.20783) | Β§3.1 | GRPO length bias inflates wrong answers β ceiling hit |
| Zero-advantage filtering | Skywork-OR1 (2505.22312) | Β§3.1 | Zero-std groups destabilize training |
| Staged rewards | Reasoning-SQL (2503.23157) | Β§3.2 | Format rewards converge first, enable task learning |
| General data mixing | Cocktail Effect (2410.01109) | Β§4 | 30% general data improves domain performance 2-15% |
| G=4 with median | MC-GRPO (2601.22582) | Β§3 | Median baseline reduces noise at small group sizes |