| # ADR-001: Tucano2-Commerce Next Steps β Detailed Execution Plans |
|
|
| **Status:** Accepted |
| **Date:** 2025-04-23 |
| **Author:** Rafael Ferraz |
| **Context:** GRPO v2 training completed (210/300 steps, early stopped). Mean validation reward 0.54 (+42% vs SFT baseline). Three critical issues diagnosed: entropy collapse, completion length ceiling, data scale. This ADR details the execution plan for the next phase. |
|
|
| --- |
|
|
| ## Phase 1: Build the Domain Benchmark |
|
|
| **Goal:** Create a rigorous, reproducible evaluation suite that measures Tucano2 across all 5 task types. |
| **Time estimate:** 1-2 days |
| **Prerequisite:** None β can start immediately |
|
|
| ### Step 1.1: Design the Benchmark Prompts |
|
|
| Create **80 held-out prompts** (never seen in training), stratified by task: |
|
|
| | Task | Count | Source | Difficulty Mix | |
| |------|-------|--------|---------------| |
| | Structured Extraction | 20 | Real reviews from Olist/B2W datasets | 10 easy (clear sentiment) + 10 hard (mixed/ambiguous) | |
| | Sentiment Analysis | 15 | Real reviews, balanced pos/neg/neutral | 5 per polarity | |
| | SQL Generation | 15 | Business questions against your e-commerce schema | 5 simple SELECT + 5 JOIN + 5 aggregate/window | |
| | Churn/Risk Prediction | 15 | Customer profiles with known outcomes | 5 low-risk + 5 medium + 5 high-risk | |
| | Business Insights | 15 | Open-ended analytical questions | 5 comparison + 5 trend + 5 recommendation | |
|
|
| **Implementation:** |
|
|
| ```python |
| # File: benchmark/prompts.jsonl |
| # Each line is a JSON object: |
| { |
| "id": "ext-001", |
| "task": "extraction", |
| "difficulty": "hard", |
| "prompt": "Analise esta avaliaΓ§Γ£o...", |
| "system": "<your system prompt>", |
| "ground_truth": { |
| "sentiment": "negativo", |
| "sentiment_score": -0.6, |
| "churn_risk": 0.8, |
| ... |
| }, |
| "notes": "Mixed sentiment β product good but delivery terrible" |
| } |
| ``` |
|
|
| **Key principles:** |
| - Include edge cases: mixed sentiment, sarcasm, regional slang ("barato saiu caro"), incomplete reviews |
| - SQL prompts must have executable ground truth queries against your actual schema |
| - Insights prompts should require multi-step reasoning, not one-hop lookups |
|
|
| ### Step 1.2: Build Automated Scoring Functions |
|
|
| Each task type gets its own scorer: |
|
|
| #### Extraction Scorer |
| ```python |
| def score_extraction(prediction: dict, ground_truth: dict) -> dict: |
| """Score each of the 10 JSON fields independently.""" |
| scores = {} |
| |
| # Categorical fields: exact match |
| for field in ["sentiment", "complaint_category"]: |
| scores[field] = 1.0 if pred[field] == gt[field] else 0.0 |
| |
| # Numeric fields: distance-based |
| for field in ["sentiment_score", "churn_risk", "repeat_intent"]: |
| scores[field] = max(0, 1.0 - abs(pred[field] - gt[field])) |
| |
| # Boolean fields: exact match |
| for field in ["delivery_issue", "product_issue", "seller_issue", "would_recommend"]: |
| scores[field] = 1.0 if pred[field] == gt[field] else 0.0 |
| |
| # Text fields: semantic similarity (sentence-transformers) |
| for field in ["main_complaint"]: |
| scores[field] = cosine_similarity(embed(pred[field]), embed(gt[field])) |
| |
| scores["mean"] = sum(scores.values()) / len(scores) |
| return scores |
| ``` |
|
|
| #### SQL Scorer |
| ```python |
| def score_sql(predicted_sql: str, ground_truth_sql: str, db_connection) -> dict: |
| """Execute both queries and compare result sets.""" |
| try: |
| pred_results = db_connection.execute(predicted_sql).fetchall() |
| gt_results = db_connection.execute(ground_truth_sql).fetchall() |
| |
| # Execution accuracy (EX): do the result sets match? |
| ex = 1.0 if set(pred_results) == set(gt_results) else 0.0 |
| |
| # Partial credit: row overlap |
| overlap = len(set(pred_results) & set(gt_results)) |
| partial = overlap / max(len(gt_results), 1) |
| |
| return {"ex": ex, "partial": partial, "syntax_valid": 1.0} |
| except Exception: |
| return {"ex": 0.0, "partial": 0.0, "syntax_valid": 0.0} |
| ``` |
|
|
| #### Sentiment Scorer |
| ```python |
| def score_sentiment(prediction: str, ground_truth: str) -> dict: |
| """Exact match on polarity, distance on score.""" |
| polarity_match = 1.0 if pred_polarity == gt_polarity else 0.0 |
| score_distance = max(0, 1.0 - abs(pred_score - gt_score) / 2.0) |
| return {"polarity": polarity_match, "score": score_distance} |
| ``` |
|
|
| #### Churn Scorer |
| ```python |
| def score_churn(prediction: float, ground_truth: float, threshold=0.5) -> dict: |
| """Binary accuracy + calibration.""" |
| binary = 1.0 if (prediction >= threshold) == (ground_truth >= threshold) else 0.0 |
| calibration = max(0, 1.0 - abs(prediction - ground_truth)) |
| return {"binary_accuracy": binary, "calibration": calibration} |
| ``` |
|
|
| #### Insights Scorer (LLM-as-Judge) |
| ```python |
| JUDGE_PROMPT = """You are evaluating a Brazilian e-commerce analysis response. |
| Rate on 4 dimensions (1-5 each): |
| 1. Relevance: Does it address the question? |
| 2. Accuracy: Are the claims factually reasonable? |
| 3. Actionability: Could a business act on this analysis? |
| 4. Portuguese Quality: Is the PT-BR natural and professional? |
| |
| Response to evaluate: |
| {response} |
| |
| Question asked: |
| {question} |
| |
| Return JSON: {"relevance": X, "accuracy": X, "actionability": X, "portuguese": X}""" |
| |
| def score_insights(response: str, question: str) -> dict: |
| """Use GPT-4o as judge, run 3x and average.""" |
| scores = [] |
| for _ in range(3): |
| result = openai.chat.completions.create( |
| model="gpt-4o", |
| messages=[{"role": "user", "content": JUDGE_PROMPT.format( |
| response=response, question=question |
| )}], |
| temperature=0.3 |
| ) |
| scores.append(json.loads(result.choices[0].message.content)) |
| |
| # Average across 3 runs |
| return {k: sum(s[k] for s in scores) / 3 for k in scores[0]} |
| ``` |
|
|
| ### Step 1.3: Run the Benchmark Script |
|
|
| ```python |
| # File: benchmark/run_benchmark.py |
| import json |
| from pathlib import Path |
| |
| def run_benchmark(model, tokenizer, prompts_path, output_path): |
| prompts = [json.loads(line) for line in open(prompts_path)] |
| results = [] |
| |
| for prompt in prompts: |
| # Generate response |
| messages = [ |
| {"role": "system", "content": prompt["system"]}, |
| {"role": "user", "content": prompt["prompt"]} |
| ] |
| response = generate(model, tokenizer, messages, max_new_tokens=2048, temperature=0.1) |
| |
| # Score based on task type |
| scorer = SCORERS[prompt["task"]] |
| score = scorer(parse_response(response), prompt.get("ground_truth")) |
| |
| results.append({ |
| "id": prompt["id"], |
| "task": prompt["task"], |
| "difficulty": prompt["difficulty"], |
| "response": response, |
| "scores": score, |
| "tokens": len(tokenizer.encode(response)) |
| }) |
| |
| # Save results |
| with open(output_path, "w") as f: |
| json.dump(results, f, indent=2, ensure_ascii=False) |
| |
| # Print summary |
| for task in ["extraction", "sentiment", "sql", "churn", "insights"]: |
| task_results = [r for r in results if r["task"] == task] |
| mean_score = sum(r["scores"]["mean"] for r in task_results) / len(task_results) |
| print(f"{task}: {mean_score:.3f} (n={len(task_results)})") |
| |
| SCORERS = { |
| "extraction": score_extraction, |
| "sentiment": score_sentiment, |
| "sql": score_sql, |
| "churn": score_churn, |
| "insights": score_insights, |
| } |
| ``` |
|
|
| ### Step 1.4: Establish Baselines |
|
|
| Run the benchmark on: |
| 1. **Qwen3-3.7B base** (zero-shot, no adapters) β this is your floor |
| 2. **Tucano2-SFT** (SFT adapter only) β this is your pre-GRPO baseline |
| 3. **Tucano2-GRPO v2** (current best) β this is what you're improving |
|
|
| Record all results in a comparison table. |
|
|
| --- |
|
|
| ## Phase 2: Run Comparison vs. Qwen3-35B-A3B |
|
|
| **Goal:** Prove domain-tuned 3.7B matches or beats general 35B on e-commerce tasks. |
| **Time estimate:** 1 day (after benchmark is built) |
| **Prerequisite:** Phase 1 complete |
|
|
| ### Step 2.1: Set Up Qwen3-35B-A3B |
|
|
| This is a Mixture-of-Experts model (35B total, ~3B active per token). It should fit on your L4 with 4-bit quantization. |
|
|
| ```python |
| from unsloth import FastLanguageModel |
| |
| model_35b, tokenizer_35b = FastLanguageModel.from_pretrained( |
| model_name="Qwen/Qwen3-35B-A3B", # Verify exact HF repo name |
| max_seq_length=4096, |
| load_in_4bit=True, |
| dtype=None, # Auto-detect |
| ) |
| FastLanguageModel.for_inference(model_35b) |
| ``` |
|
|
| **Memory estimate:** 35B params Γ 0.5 bytes (4-bit) β 17.5GB. Active inference ~3B β 3GB compute. Should fit on L4 (24GB) with headroom. |
|
|
| **If it doesn't fit:** Use `transformers` with `BitsAndBytesConfig(load_in_4bit=True)` instead of Unsloth, or try GGUF via `llama.cpp`. |
|
|
| ### Step 2.2: Run the Same Benchmark |
|
|
| ```python |
| # Run identical benchmark on Qwen3-35B-A3B |
| run_benchmark(model_35b, tokenizer_35b, "benchmark/prompts.jsonl", "results/qwen3-35b.json") |
| ``` |
|
|
| **Critical:** Use the same system prompt, same temperature (0.1 for eval), same max_new_tokens. The only variable should be the model. |
|
|
| ### Step 2.3: Optional β Add GPT-4o Baseline |
|
|
| ```python |
| # For the strongest reference point |
| import openai |
| |
| def run_benchmark_api(prompts_path, output_path, model="gpt-4o"): |
| prompts = [json.loads(line) for line in open(prompts_path)] |
| results = [] |
| |
| for prompt in prompts: |
| response = openai.chat.completions.create( |
| model=model, |
| messages=[ |
| {"role": "system", "content": prompt["system"]}, |
| {"role": "user", "content": prompt["prompt"]} |
| ], |
| temperature=0.1 |
| ) |
| # ... score same as above |
| ``` |
|
|
| ### Step 2.4: Build the Comparison Report |
|
|
| ``` |
| ββββββββββββββββ¬βββββββββββββ¬βββββββββββββ¬βββββββββββββ¬βββββββββββββ¬βββββββββββ |
| β Task β Qwen3-3.7B β Tucano2-SFTβ Tucano2- β Qwen3-35B β GPT-4o β |
| β β (base) β β GRPO v2 β (zero-shot)β(zero-shotβ |
| ββββββββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββ€ |
| β Extraction β β β β β β |
| β Sentiment β β β β β β |
| β SQL β β β β β β |
| β Churn β β β β β β |
| β Insights β β β β β β |
| ββββββββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββ€ |
| β MEAN β β β β β β |
| ββββββββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββ€ |
| β Cost/query β $0.001 β $0.001 β $0.001 β $0.003 β $0.010 β |
| β Latency (s) β β β β β β |
| β Privacy β β
On-prem β β
On-prem β β
On-prem β β
On-prem β β API β |
| ββββββββββββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββ |
| ``` |
|
|
| **The story to tell:** |
| - If Tucano2-GRPO β₯ Qwen3-35B on extraction/sentiment/SQL: "Domain tuning eliminates the need for 10Γ larger models" |
| - If Tucano2-GRPO < Qwen3-35B on insights: "Open-ended reasoning still benefits from scale, but structured tasks don't" |
| - Cost column proves the business case regardless of performance parity |
|
|
| --- |
|
|
| ## Phase 3: GRPO v3 Training Run |
|
|
| **Goal:** Fix entropy collapse, break through the v2 performance plateau. |
| **Time estimate:** 2-3 days (including data prep) |
| **Prerequisite:** Phase 1 (benchmark needed to measure improvement) |
|
|
| ### Step 3.1: Expand Training Data (1000+ prompts) |
|
|
| **Target: 1000 prompts** (up from 300), stratified: |
|
|
| | Task | Current | Target | How to Expand | |
| |------|---------|--------|--------------| |
| | Extraction | ~75 | 300 | Generate from Olist dataset β sample reviews, create ground truth JSON | |
| | Sentiment | ~60 | 200 | Sample from B2W reviews corpus, label with existing model + human review | |
| | SQL | ~75 | 250 | Template-based: vary table names, WHERE clauses, aggregation patterns | |
| | Churn | ~45 | 100 | Augment customer profiles with synthetic variations | |
| | Insights | ~45 | 150 | LLM-generated analytical questions about e-commerce scenarios | |
|
|
| **Synthetic generation recipe (from Cocktail Effect paper):** |
| ```python |
| # Use your SFT model or GPT-4o to generate new training prompts |
| # Then manually verify/correct the ground truth |
| def generate_extraction_prompts(reviews_df, n=200): |
| prompts = [] |
| for _, row in reviews_df.sample(n).iterrows(): |
| prompt = format_extraction_prompt(row["review_text"], row["score"], row["status"]) |
| # Generate ground truth with GPT-4o (higher quality than self-labeling) |
| ground_truth = gpt4o_extract(prompt) |
| prompts.append({"prompt": prompt, "ground_truth": ground_truth}) |
| return prompts |
| ``` |
|
|
| **Also add 30% general reasoning data** (Cocktail Effect paper finding): |
| - Source: Portuguese subset of Orca-Math or translated OpenOrca |
| - Purpose: Regularization β prevents model from overfitting to domain patterns |
| - Mix ratio: 700 domain + 300 general = 1000 total |
|
|
| ### Step 3.2: Configuration Changes for v3 |
|
|
| ```python |
| # === GRPO v3 Config Changes === |
| |
| # FIX 1: Temperature β prevent entropy collapse |
| TEMPERATURE = 1.0 # Was 0.8 in v2. All published GRPO papers use 1.0. |
| # Reference: Skywork-OR1 (2505.22312) ablation shows Ο=1.0 >> Ο=0.6 |
| |
| # FIX 2: Completion length β remove the ceiling |
| MAX_COMPLETION_LENGTH = 4096 # Was 2048. Every v2 completion hit the ceiling. |
| # Trade-off: halve num_generations to fit VRAM |
| |
| # FIX 3: Reduce generations to fit longer completions |
| NUM_GENERATIONS = 4 # Was 8. 4 Γ 4096 β 8 Γ 2048 in VRAM terms. |
| # MC-GRPO paper shows G=4 can work if using median baseline |
| |
| # FIX 4: Explicit Ξ²=0 (no KL penalty) |
| # Dr. GRPO paper: Ξ²=0 is optimal for rule-based rewards |
| BETA = 0.0 |
| |
| # FIX 5: Learning rate β slightly more aggressive |
| LEARNING_RATE = 3e-6 # Was 2e-6. Clip ratios were all 0 β room to push harder. |
| # Still well within published range (1e-6 to 5e-6) |
| |
| # FIX 6: Max steps for expanded data |
| # 1000 prompts Γ 4 generations / (4 batch Γ 2 accum) = 500 steps/epoch |
| MAX_STEPS = 500 |
| |
| # FIX 7: Early stopping β more generous |
| EARLY_STOPPING_PATIENCE = 15 # 15 evals Γ 10 steps = 150 steps of runway |
| EVAL_STEPS = 10 |
| SAVE_STEPS = 10 |
| ``` |
|
|
| ### Step 3.3: Add Entropy Monitoring (Critical for v3) |
|
|
| Since TRL 0.24.0 doesn't have native entropy bonus, implement via callback: |
|
|
| ```python |
| class EntropyMonitorCallback(TrainerCallback): |
| """Monitor policy entropy to detect collapse early.""" |
| |
| def on_log(self, args, state, control, logs=None, **kwargs): |
| if logs and "train/completion_length" in logs: |
| # Proxy for entropy: if all completions are max length, |
| # entropy is collapsing (model is stuck in one mode) |
| completion_ratio = logs["train/completion_length"] / MAX_COMPLETION_LENGTH |
| |
| if completion_ratio > 0.95: |
| print(f"β οΈ Step {state.global_step}: Completion ratio {completion_ratio:.2f} β " |
| f"possible entropy collapse. Monitor reward_std.") |
| |
| # Log to W&B |
| if wandb.run: |
| wandb.log({ |
| "monitor/completion_ratio": completion_ratio, |
| "monitor/entropy_proxy": 1.0 - completion_ratio, |
| }, step=state.global_step) |
| ``` |
|
|
| ### Step 3.4: Add Zero-Advantage Group Filtering |
|
|
| ```python |
| # In the reward function wrapper or custom trainer: |
| def filtered_commerce_reward_fn(completions, prompts, **kwargs): |
| """Compute rewards and flag zero-variance groups for filtering.""" |
| rewards = commerce_reward_fn(completions, prompts, **kwargs) |
| |
| # Group rewards by prompt (each prompt has NUM_GENERATIONS completions) |
| for i in range(0, len(rewards), NUM_GENERATIONS): |
| group = rewards[i:i+NUM_GENERATIONS] |
| if max(group) - min(group) < 0.01: # Near-zero variance |
| # Add small noise to break the tie |
| # This prevents the GRPO denominator from exploding |
| for j in range(i, i+NUM_GENERATIONS): |
| rewards[j] += random.gauss(0, 0.005) |
| |
| return rewards |
| ``` |
|
|
| **Note:** The proper fix is MC-GRPO's median baseline, but that requires modifying TRL internals. The noise injection is a pragmatic workaround for TRL 0.24.0. |
|
|
| ### Step 3.5: Reward Function Refinement |
|
|
| Split the composite reward into staged components (Reasoning-SQL paper finding): |
|
|
| ```python |
| def commerce_reward_fn_v3(completions, prompts, **kwargs): |
| """Multi-component reward with staged convergence.""" |
| rewards = [] |
| for completion, prompt in zip(completions, prompts): |
| # Stage 1: Format reward (converges first) |
| r_format = score_format(completion) # JSON valid? Think tags closed? Right structure? |
| |
| # Stage 2: Partial content reward |
| r_partial = score_partial_content(completion, prompt) # Some fields correct? |
| |
| # Stage 3: Full task reward |
| r_task = score_full_task(completion, prompt) # All fields correct? SQL executes? |
| |
| # Weighted combination β format weight decreases over training |
| reward = 0.2 * r_format + 0.3 * r_partial + 0.5 * r_task |
| rewards.append(reward) |
| |
| return rewards |
| ``` |
|
|
| ### Step 3.6: VRAM Budget for v3 |
|
|
| ``` |
| L4 GPU: 24GB total |
| |
| Model (NF4): ~3.5GB |
| KV Cache (4096 seq): ~2.0GB |
| Activations: ~4.0GB |
| Optimizer states: ~3.0GB |
| Generations (4Γ4096): ~8.0GB |
| βββββββββββββββββββββββββββββ |
| Estimated total: ~20.5GB |
| Headroom: ~3.5GB |
| |
| β
Should fit. If OOM: reduce MAX_COMPLETION_LENGTH to 3072 first. |
| ``` |
|
|
| ### Step 3.7: Training Execution |
|
|
| ```bash |
| # Pre-flight checklist: |
| # β
Benchmark built and baselines recorded (Phase 1) |
| # β
1000+ prompts prepared and validated |
| # β
Config changes applied (temperature, completion length, generations, LR) |
| # β
Entropy monitor callback added |
| # β
Zero-advantage filtering active |
| # β
Reward function v3 with staged components |
| # β
FRESH=True (nuke old checkpoints) |
| # β
W&B run name: grpo-v3-l4-{timestamp} |
| |
| # Expected runtime: 500 steps Γ ~5 min/step (longer completions) β 42 hours |
| # Checkpoints every 10 steps |
| # Early stopping patience: 15 evals (150 steps) |
| ``` |
|
|
| ### Step 3.8: Post-Training Validation |
|
|
| 1. Run Phase 1 benchmark on GRPO v3 best checkpoint |
| 2. Compare against all baselines: |
| - Qwen3-3.7B base β Tucano2-SFT β Tucano2-GRPO-v2 β **Tucano2-GRPO-v3** |
| 3. If v3 > v2 on benchmark: save as production model |
| 4. If v3 β v2: entropy collapse not fully fixed; consider switching to MC-GRPO or upgrading GPU for longer completions |
| 5. If v3 < v2: rollback, investigate β likely reward function regression |
|
|
| --- |
|
|
| ## Decision Criteria for Stopping |
|
|
| | Condition | Action | |
| |-----------|--------| |
| | v3 eval reward > 0.20 AND extraction score > 0.40 | Ship it β significantly better than v2 | |
| | v3 eval reward 0.15-0.20, improving trend | Run epoch 2 (extend MAX_STEPS to 1000) | |
| | v3 eval reward < v2 (0.125) | Stop, diagnose, review reward function and data | |
| | Entropy collapse again (clip_ratio=0 after step 50) | Add entropy bonus via custom loss (requires TRL fork) | |
| | OOM | Reduce MAX_COMPLETION_LENGTH to 3072 β 2560 β 2048 | |
|
|
| --- |
|
|
| ## Appendix: Literature References for Each Fix |
|
|
| | Fix | Paper | Section | Key Finding | |
| |-----|-------|---------|-------------| |
| | Temperature=1.0 | Skywork-OR1 (2505.22312) | Β§4, Table 3 | Ο=1.0 gives 5-8% better performance, delays entropy collapse | |
| | Ξ²=0 (no KL) | Dr. GRPO (2503.20783) | Β§3.2 | KL penalty unnecessary for rule-based rewards | |
| | scale_rewards=False | Dr. GRPO (2503.20783) | Β§3.1 | Std normalization biases toward low-variance groups | |
| | Longer completions | Dr. GRPO (2503.20783) | Β§3.1 | GRPO length bias inflates wrong answers β ceiling hit | |
| | Zero-advantage filtering | Skywork-OR1 (2505.22312) | Β§3.1 | Zero-std groups destabilize training | |
| | Staged rewards | Reasoning-SQL (2503.23157) | Β§3.2 | Format rewards converge first, enable task learning | |
| | General data mixing | Cocktail Effect (2410.01109) | Β§4 | 30% general data improves domain performance 2-15% | |
| | G=4 with median | MC-GRPO (2601.22582) | Β§3 | Median baseline reduces noise at small group sizes | |
| |