tucano2-commerce / docs /ADR-001-next-steps.md
rtferraz's picture
docs: add ADR-001 next steps with detailed execution plans
b47b36b verified

ADR-001: Tucano2-Commerce Next Steps β€” Detailed Execution Plans

Status: Accepted
Date: 2025-04-23
Author: Rafael Ferraz
Context: GRPO v2 training completed (210/300 steps, early stopped). Mean validation reward 0.54 (+42% vs SFT baseline). Three critical issues diagnosed: entropy collapse, completion length ceiling, data scale. This ADR details the execution plan for the next phase.


Phase 1: Build the Domain Benchmark

Goal: Create a rigorous, reproducible evaluation suite that measures Tucano2 across all 5 task types.
Time estimate: 1-2 days
Prerequisite: None β€” can start immediately

Step 1.1: Design the Benchmark Prompts

Create 80 held-out prompts (never seen in training), stratified by task:

Task Count Source Difficulty Mix
Structured Extraction 20 Real reviews from Olist/B2W datasets 10 easy (clear sentiment) + 10 hard (mixed/ambiguous)
Sentiment Analysis 15 Real reviews, balanced pos/neg/neutral 5 per polarity
SQL Generation 15 Business questions against your e-commerce schema 5 simple SELECT + 5 JOIN + 5 aggregate/window
Churn/Risk Prediction 15 Customer profiles with known outcomes 5 low-risk + 5 medium + 5 high-risk
Business Insights 15 Open-ended analytical questions 5 comparison + 5 trend + 5 recommendation

Implementation:

# File: benchmark/prompts.jsonl
# Each line is a JSON object:
{
    "id": "ext-001",
    "task": "extraction",
    "difficulty": "hard",
    "prompt": "Analise esta avaliaΓ§Γ£o...",
    "system": "<your system prompt>",
    "ground_truth": {
        "sentiment": "negativo",
        "sentiment_score": -0.6,
        "churn_risk": 0.8,
        ...
    },
    "notes": "Mixed sentiment β€” product good but delivery terrible"
}

Key principles:

  • Include edge cases: mixed sentiment, sarcasm, regional slang ("barato saiu caro"), incomplete reviews
  • SQL prompts must have executable ground truth queries against your actual schema
  • Insights prompts should require multi-step reasoning, not one-hop lookups

Step 1.2: Build Automated Scoring Functions

Each task type gets its own scorer:

Extraction Scorer

def score_extraction(prediction: dict, ground_truth: dict) -> dict:
    """Score each of the 10 JSON fields independently."""
    scores = {}
    
    # Categorical fields: exact match
    for field in ["sentiment", "complaint_category"]:
        scores[field] = 1.0 if pred[field] == gt[field] else 0.0
    
    # Numeric fields: distance-based
    for field in ["sentiment_score", "churn_risk", "repeat_intent"]:
        scores[field] = max(0, 1.0 - abs(pred[field] - gt[field]))
    
    # Boolean fields: exact match
    for field in ["delivery_issue", "product_issue", "seller_issue", "would_recommend"]:
        scores[field] = 1.0 if pred[field] == gt[field] else 0.0
    
    # Text fields: semantic similarity (sentence-transformers)
    for field in ["main_complaint"]:
        scores[field] = cosine_similarity(embed(pred[field]), embed(gt[field]))
    
    scores["mean"] = sum(scores.values()) / len(scores)
    return scores

SQL Scorer

def score_sql(predicted_sql: str, ground_truth_sql: str, db_connection) -> dict:
    """Execute both queries and compare result sets."""
    try:
        pred_results = db_connection.execute(predicted_sql).fetchall()
        gt_results = db_connection.execute(ground_truth_sql).fetchall()
        
        # Execution accuracy (EX): do the result sets match?
        ex = 1.0 if set(pred_results) == set(gt_results) else 0.0
        
        # Partial credit: row overlap
        overlap = len(set(pred_results) & set(gt_results))
        partial = overlap / max(len(gt_results), 1)
        
        return {"ex": ex, "partial": partial, "syntax_valid": 1.0}
    except Exception:
        return {"ex": 0.0, "partial": 0.0, "syntax_valid": 0.0}

Sentiment Scorer

def score_sentiment(prediction: str, ground_truth: str) -> dict:
    """Exact match on polarity, distance on score."""
    polarity_match = 1.0 if pred_polarity == gt_polarity else 0.0
    score_distance = max(0, 1.0 - abs(pred_score - gt_score) / 2.0)
    return {"polarity": polarity_match, "score": score_distance}

Churn Scorer

def score_churn(prediction: float, ground_truth: float, threshold=0.5) -> dict:
    """Binary accuracy + calibration."""
    binary = 1.0 if (prediction >= threshold) == (ground_truth >= threshold) else 0.0
    calibration = max(0, 1.0 - abs(prediction - ground_truth))
    return {"binary_accuracy": binary, "calibration": calibration}

Insights Scorer (LLM-as-Judge)

JUDGE_PROMPT = """You are evaluating a Brazilian e-commerce analysis response.
Rate on 4 dimensions (1-5 each):
1. Relevance: Does it address the question?
2. Accuracy: Are the claims factually reasonable?
3. Actionability: Could a business act on this analysis?
4. Portuguese Quality: Is the PT-BR natural and professional?

Response to evaluate:
{response}

Question asked:
{question}

Return JSON: {"relevance": X, "accuracy": X, "actionability": X, "portuguese": X}"""

def score_insights(response: str, question: str) -> dict:
    """Use GPT-4o as judge, run 3x and average."""
    scores = []
    for _ in range(3):
        result = openai.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": JUDGE_PROMPT.format(
                response=response, question=question
            )}],
            temperature=0.3
        )
        scores.append(json.loads(result.choices[0].message.content))
    
    # Average across 3 runs
    return {k: sum(s[k] for s in scores) / 3 for k in scores[0]}

Step 1.3: Run the Benchmark Script

# File: benchmark/run_benchmark.py
import json
from pathlib import Path

def run_benchmark(model, tokenizer, prompts_path, output_path):
    prompts = [json.loads(line) for line in open(prompts_path)]
    results = []
    
    for prompt in prompts:
        # Generate response
        messages = [
            {"role": "system", "content": prompt["system"]},
            {"role": "user", "content": prompt["prompt"]}
        ]
        response = generate(model, tokenizer, messages, max_new_tokens=2048, temperature=0.1)
        
        # Score based on task type
        scorer = SCORERS[prompt["task"]]
        score = scorer(parse_response(response), prompt.get("ground_truth"))
        
        results.append({
            "id": prompt["id"],
            "task": prompt["task"],
            "difficulty": prompt["difficulty"],
            "response": response,
            "scores": score,
            "tokens": len(tokenizer.encode(response))
        })
    
    # Save results
    with open(output_path, "w") as f:
        json.dump(results, f, indent=2, ensure_ascii=False)
    
    # Print summary
    for task in ["extraction", "sentiment", "sql", "churn", "insights"]:
        task_results = [r for r in results if r["task"] == task]
        mean_score = sum(r["scores"]["mean"] for r in task_results) / len(task_results)
        print(f"{task}: {mean_score:.3f} (n={len(task_results)})")

SCORERS = {
    "extraction": score_extraction,
    "sentiment": score_sentiment,
    "sql": score_sql,
    "churn": score_churn,
    "insights": score_insights,
}

Step 1.4: Establish Baselines

Run the benchmark on:

  1. Qwen3-3.7B base (zero-shot, no adapters) β€” this is your floor
  2. Tucano2-SFT (SFT adapter only) β€” this is your pre-GRPO baseline
  3. Tucano2-GRPO v2 (current best) β€” this is what you're improving

Record all results in a comparison table.


Phase 2: Run Comparison vs. Qwen3-35B-A3B

Goal: Prove domain-tuned 3.7B matches or beats general 35B on e-commerce tasks.
Time estimate: 1 day (after benchmark is built)
Prerequisite: Phase 1 complete

Step 2.1: Set Up Qwen3-35B-A3B

This is a Mixture-of-Experts model (35B total, ~3B active per token). It should fit on your L4 with 4-bit quantization.

from unsloth import FastLanguageModel

model_35b, tokenizer_35b = FastLanguageModel.from_pretrained(
    model_name="Qwen/Qwen3-35B-A3B",  # Verify exact HF repo name
    max_seq_length=4096,
    load_in_4bit=True,
    dtype=None,  # Auto-detect
)
FastLanguageModel.for_inference(model_35b)

Memory estimate: 35B params Γ— 0.5 bytes (4-bit) β‰ˆ 17.5GB. Active inference ~3B β‰ˆ 3GB compute. Should fit on L4 (24GB) with headroom.

If it doesn't fit: Use transformers with BitsAndBytesConfig(load_in_4bit=True) instead of Unsloth, or try GGUF via llama.cpp.

Step 2.2: Run the Same Benchmark

# Run identical benchmark on Qwen3-35B-A3B
run_benchmark(model_35b, tokenizer_35b, "benchmark/prompts.jsonl", "results/qwen3-35b.json")

Critical: Use the same system prompt, same temperature (0.1 for eval), same max_new_tokens. The only variable should be the model.

Step 2.3: Optional β€” Add GPT-4o Baseline

# For the strongest reference point
import openai

def run_benchmark_api(prompts_path, output_path, model="gpt-4o"):
    prompts = [json.loads(line) for line in open(prompts_path)]
    results = []
    
    for prompt in prompts:
        response = openai.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": prompt["system"]},
                {"role": "user", "content": prompt["prompt"]}
            ],
            temperature=0.1
        )
        # ... score same as above

Step 2.4: Build the Comparison Report

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Task         β”‚ Qwen3-3.7B β”‚ Tucano2-SFTβ”‚ Tucano2-   β”‚ Qwen3-35B  β”‚ GPT-4o   β”‚
β”‚              β”‚ (base)     β”‚            β”‚ GRPO v2    β”‚ (zero-shot)β”‚(zero-shotβ”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Extraction   β”‚            β”‚            β”‚            β”‚            β”‚          β”‚
β”‚ Sentiment    β”‚            β”‚            β”‚            β”‚            β”‚          β”‚
β”‚ SQL          β”‚            β”‚            β”‚            β”‚            β”‚          β”‚
β”‚ Churn        β”‚            β”‚            β”‚            β”‚            β”‚          β”‚
β”‚ Insights     β”‚            β”‚            β”‚            β”‚            β”‚          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ MEAN         β”‚            β”‚            β”‚            β”‚            β”‚          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Cost/query   β”‚ $0.001     β”‚ $0.001     β”‚ $0.001     β”‚ $0.003     β”‚ $0.010   β”‚
β”‚ Latency (s)  β”‚            β”‚            β”‚            β”‚            β”‚          β”‚
β”‚ Privacy      β”‚ βœ… On-prem β”‚ βœ… On-prem β”‚ βœ… On-prem β”‚ βœ… On-prem β”‚ ❌ API   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The story to tell:

  • If Tucano2-GRPO β‰₯ Qwen3-35B on extraction/sentiment/SQL: "Domain tuning eliminates the need for 10Γ— larger models"
  • If Tucano2-GRPO < Qwen3-35B on insights: "Open-ended reasoning still benefits from scale, but structured tasks don't"
  • Cost column proves the business case regardless of performance parity

Phase 3: GRPO v3 Training Run

Goal: Fix entropy collapse, break through the v2 performance plateau.
Time estimate: 2-3 days (including data prep)
Prerequisite: Phase 1 (benchmark needed to measure improvement)

Step 3.1: Expand Training Data (1000+ prompts)

Target: 1000 prompts (up from 300), stratified:

Task Current Target How to Expand
Extraction ~75 300 Generate from Olist dataset β€” sample reviews, create ground truth JSON
Sentiment ~60 200 Sample from B2W reviews corpus, label with existing model + human review
SQL ~75 250 Template-based: vary table names, WHERE clauses, aggregation patterns
Churn ~45 100 Augment customer profiles with synthetic variations
Insights ~45 150 LLM-generated analytical questions about e-commerce scenarios

Synthetic generation recipe (from Cocktail Effect paper):

# Use your SFT model or GPT-4o to generate new training prompts
# Then manually verify/correct the ground truth
def generate_extraction_prompts(reviews_df, n=200):
    prompts = []
    for _, row in reviews_df.sample(n).iterrows():
        prompt = format_extraction_prompt(row["review_text"], row["score"], row["status"])
        # Generate ground truth with GPT-4o (higher quality than self-labeling)
        ground_truth = gpt4o_extract(prompt)
        prompts.append({"prompt": prompt, "ground_truth": ground_truth})
    return prompts

Also add 30% general reasoning data (Cocktail Effect paper finding):

  • Source: Portuguese subset of Orca-Math or translated OpenOrca
  • Purpose: Regularization β€” prevents model from overfitting to domain patterns
  • Mix ratio: 700 domain + 300 general = 1000 total

Step 3.2: Configuration Changes for v3

# === GRPO v3 Config Changes ===

# FIX 1: Temperature β€” prevent entropy collapse
TEMPERATURE = 1.0  # Was 0.8 in v2. All published GRPO papers use 1.0.
# Reference: Skywork-OR1 (2505.22312) ablation shows Ο„=1.0 >> Ο„=0.6

# FIX 2: Completion length β€” remove the ceiling
MAX_COMPLETION_LENGTH = 4096  # Was 2048. Every v2 completion hit the ceiling.
# Trade-off: halve num_generations to fit VRAM

# FIX 3: Reduce generations to fit longer completions
NUM_GENERATIONS = 4  # Was 8. 4 Γ— 4096 β‰ˆ 8 Γ— 2048 in VRAM terms.
# MC-GRPO paper shows G=4 can work if using median baseline

# FIX 4: Explicit Ξ²=0 (no KL penalty)
# Dr. GRPO paper: Ξ²=0 is optimal for rule-based rewards
BETA = 0.0

# FIX 5: Learning rate β€” slightly more aggressive
LEARNING_RATE = 3e-6  # Was 2e-6. Clip ratios were all 0 β†’ room to push harder.
# Still well within published range (1e-6 to 5e-6)

# FIX 6: Max steps for expanded data
# 1000 prompts Γ— 4 generations / (4 batch Γ— 2 accum) = 500 steps/epoch
MAX_STEPS = 500

# FIX 7: Early stopping β€” more generous
EARLY_STOPPING_PATIENCE = 15  # 15 evals Γ— 10 steps = 150 steps of runway
EVAL_STEPS = 10
SAVE_STEPS = 10

Step 3.3: Add Entropy Monitoring (Critical for v3)

Since TRL 0.24.0 doesn't have native entropy bonus, implement via callback:

class EntropyMonitorCallback(TrainerCallback):
    """Monitor policy entropy to detect collapse early."""
    
    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs and "train/completion_length" in logs:
            # Proxy for entropy: if all completions are max length,
            # entropy is collapsing (model is stuck in one mode)
            completion_ratio = logs["train/completion_length"] / MAX_COMPLETION_LENGTH
            
            if completion_ratio > 0.95:
                print(f"⚠️ Step {state.global_step}: Completion ratio {completion_ratio:.2f} β€” "
                      f"possible entropy collapse. Monitor reward_std.")
            
            # Log to W&B
            if wandb.run:
                wandb.log({
                    "monitor/completion_ratio": completion_ratio,
                    "monitor/entropy_proxy": 1.0 - completion_ratio,
                }, step=state.global_step)

Step 3.4: Add Zero-Advantage Group Filtering

# In the reward function wrapper or custom trainer:
def filtered_commerce_reward_fn(completions, prompts, **kwargs):
    """Compute rewards and flag zero-variance groups for filtering."""
    rewards = commerce_reward_fn(completions, prompts, **kwargs)
    
    # Group rewards by prompt (each prompt has NUM_GENERATIONS completions)
    for i in range(0, len(rewards), NUM_GENERATIONS):
        group = rewards[i:i+NUM_GENERATIONS]
        if max(group) - min(group) < 0.01:  # Near-zero variance
            # Add small noise to break the tie
            # This prevents the GRPO denominator from exploding
            for j in range(i, i+NUM_GENERATIONS):
                rewards[j] += random.gauss(0, 0.005)
    
    return rewards

Note: The proper fix is MC-GRPO's median baseline, but that requires modifying TRL internals. The noise injection is a pragmatic workaround for TRL 0.24.0.

Step 3.5: Reward Function Refinement

Split the composite reward into staged components (Reasoning-SQL paper finding):

def commerce_reward_fn_v3(completions, prompts, **kwargs):
    """Multi-component reward with staged convergence."""
    rewards = []
    for completion, prompt in zip(completions, prompts):
        # Stage 1: Format reward (converges first)
        r_format = score_format(completion)  # JSON valid? Think tags closed? Right structure?
        
        # Stage 2: Partial content reward
        r_partial = score_partial_content(completion, prompt)  # Some fields correct?
        
        # Stage 3: Full task reward  
        r_task = score_full_task(completion, prompt)  # All fields correct? SQL executes?
        
        # Weighted combination β€” format weight decreases over training
        reward = 0.2 * r_format + 0.3 * r_partial + 0.5 * r_task
        rewards.append(reward)
    
    return rewards

Step 3.6: VRAM Budget for v3

L4 GPU: 24GB total

Model (NF4):          ~3.5GB
KV Cache (4096 seq):  ~2.0GB
Activations:          ~4.0GB
Optimizer states:     ~3.0GB
Generations (4Γ—4096): ~8.0GB
─────────────────────────────
Estimated total:      ~20.5GB
Headroom:             ~3.5GB

βœ… Should fit. If OOM: reduce MAX_COMPLETION_LENGTH to 3072 first.

Step 3.7: Training Execution

# Pre-flight checklist:
# βœ… Benchmark built and baselines recorded (Phase 1)
# βœ… 1000+ prompts prepared and validated
# βœ… Config changes applied (temperature, completion length, generations, LR)
# βœ… Entropy monitor callback added
# βœ… Zero-advantage filtering active
# βœ… Reward function v3 with staged components
# βœ… FRESH=True (nuke old checkpoints)
# βœ… W&B run name: grpo-v3-l4-{timestamp}

# Expected runtime: 500 steps Γ— ~5 min/step (longer completions) β‰ˆ 42 hours
# Checkpoints every 10 steps
# Early stopping patience: 15 evals (150 steps)

Step 3.8: Post-Training Validation

  1. Run Phase 1 benchmark on GRPO v3 best checkpoint
  2. Compare against all baselines:
    • Qwen3-3.7B base β†’ Tucano2-SFT β†’ Tucano2-GRPO-v2 β†’ Tucano2-GRPO-v3
  3. If v3 > v2 on benchmark: save as production model
  4. If v3 β‰ˆ v2: entropy collapse not fully fixed; consider switching to MC-GRPO or upgrading GPU for longer completions
  5. If v3 < v2: rollback, investigate β€” likely reward function regression

Decision Criteria for Stopping

Condition Action
v3 eval reward > 0.20 AND extraction score > 0.40 Ship it β€” significantly better than v2
v3 eval reward 0.15-0.20, improving trend Run epoch 2 (extend MAX_STEPS to 1000)
v3 eval reward < v2 (0.125) Stop, diagnose, review reward function and data
Entropy collapse again (clip_ratio=0 after step 50) Add entropy bonus via custom loss (requires TRL fork)
OOM Reduce MAX_COMPLETION_LENGTH to 3072 β†’ 2560 β†’ 2048

Appendix: Literature References for Each Fix

Fix Paper Section Key Finding
Temperature=1.0 Skywork-OR1 (2505.22312) Β§4, Table 3 Ο„=1.0 gives 5-8% better performance, delays entropy collapse
Ξ²=0 (no KL) Dr. GRPO (2503.20783) Β§3.2 KL penalty unnecessary for rule-based rewards
scale_rewards=False Dr. GRPO (2503.20783) Β§3.1 Std normalization biases toward low-variance groups
Longer completions Dr. GRPO (2503.20783) Β§3.1 GRPO length bias inflates wrong answers β†’ ceiling hit
Zero-advantage filtering Skywork-OR1 (2505.22312) Β§3.1 Zero-std groups destabilize training
Staged rewards Reasoning-SQL (2503.23157) Β§3.2 Format rewards converge first, enable task learning
General data mixing Cocktail Effect (2410.01109) Β§4 30% general data improves domain performance 2-15%
G=4 with median MC-GRPO (2601.22582) Β§3 Median baseline reduces noise at small group sizes