rtferraz
/

tucano2-commerce

Model card Files Files and versions

xet

Community

rtferraz commited on 14 days ago

Commit

b47b36b

verified ·

1 Parent(s): aa71b0c

docs: add ADR-001 next steps with detailed execution plans

Browse files

Files changed (1) hide show

docs/ADR-001-next-steps.md +517 -0

docs/ADR-001-next-steps.md ADDED Viewed

	@@ -0,0 +1,517 @@

+# ADR-001: Tucano2-Commerce Next Steps — Detailed Execution Plans
+**Status:** Accepted
+**Date:** 2025-04-23
+**Author:** Rafael Ferraz
+**Context:** GRPO v2 training completed (210/300 steps, early stopped). Mean validation reward 0.54 (+42% vs SFT baseline). Three critical issues diagnosed: entropy collapse, completion length ceiling, data scale. This ADR details the execution plan for the next phase.
+---
+## Phase 1: Build the Domain Benchmark
+**Goal:** Create a rigorous, reproducible evaluation suite that measures Tucano2 across all 5 task types.
+**Time estimate:** 1-2 days
+**Prerequisite:** None — can start immediately
+### Step 1.1: Design the Benchmark Prompts
+Create **80 held-out prompts** (never seen in training), stratified by task:
+| Task | Count | Source | Difficulty Mix |
+|------|-------|--------|---------------|
+| Structured Extraction | 20 | Real reviews from Olist/B2W datasets | 10 easy (clear sentiment) + 10 hard (mixed/ambiguous) |
+| Sentiment Analysis | 15 | Real reviews, balanced pos/neg/neutral | 5 per polarity |
+| SQL Generation | 15 | Business questions against your e-commerce schema | 5 simple SELECT + 5 JOIN + 5 aggregate/window |
+| Churn/Risk Prediction | 15 | Customer profiles with known outcomes | 5 low-risk + 5 medium + 5 high-risk |
+| Business Insights | 15 | Open-ended analytical questions | 5 comparison + 5 trend + 5 recommendation |
+**Implementation:**
+```python
+# File: benchmark/prompts.jsonl
+# Each line is a JSON object:
+{
+    "id": "ext-001",
+    "task": "extraction",
+    "difficulty": "hard",
+    "prompt": "Analise esta avaliação...",
+    "system": "<your system prompt>",
+    "ground_truth": {
+        "sentiment": "negativo",
+        "sentiment_score": -0.6,
+        "churn_risk": 0.8,
+        ...
+    },
+    "notes": "Mixed sentiment — product good but delivery terrible"
+}
+```
+**Key principles:**
+- Include edge cases: mixed sentiment, sarcasm, regional slang ("barato saiu caro"), incomplete reviews
+- SQL prompts must have executable ground truth queries against your actual schema
+- Insights prompts should require multi-step reasoning, not one-hop lookups
+### Step 1.2: Build Automated Scoring Functions
+Each task type gets its own scorer:
+#### Extraction Scorer
+```python
+def score_extraction(prediction: dict, ground_truth: dict) -> dict:
+    """Score each of the 10 JSON fields independently."""
+    scores = {}
+    # Categorical fields: exact match
+    for field in ["sentiment", "complaint_category"]:
+        scores[field] = 1.0 if pred[field] == gt[field] else 0.0
+    # Numeric fields: distance-based
+    for field in ["sentiment_score", "churn_risk", "repeat_intent"]:
+        scores[field] = max(0, 1.0 - abs(pred[field] - gt[field]))
+    # Boolean fields: exact match
+    for field in ["delivery_issue", "product_issue", "seller_issue", "would_recommend"]:
+        scores[field] = 1.0 if pred[field] == gt[field] else 0.0
+    # Text fields: semantic similarity (sentence-transformers)
+    for field in ["main_complaint"]:
+        scores[field] = cosine_similarity(embed(pred[field]), embed(gt[field]))
+    scores["mean"] = sum(scores.values()) / len(scores)
+    return scores
+```
+#### SQL Scorer
+```python
+def score_sql(predicted_sql: str, ground_truth_sql: str, db_connection) -> dict:
+    """Execute both queries and compare result sets."""
+    try:
+        pred_results = db_connection.execute(predicted_sql).fetchall()
+        gt_results = db_connection.execute(ground_truth_sql).fetchall()
+        # Execution accuracy (EX): do the result sets match?
+        ex = 1.0 if set(pred_results) == set(gt_results) else 0.0
+        # Partial credit: row overlap
+        overlap = len(set(pred_results) & set(gt_results))
+        partial = overlap / max(len(gt_results), 1)
+        return {"ex": ex, "partial": partial, "syntax_valid": 1.0}
+    except Exception:
+        return {"ex": 0.0, "partial": 0.0, "syntax_valid": 0.0}
+```
+#### Sentiment Scorer
+```python
+def score_sentiment(prediction: str, ground_truth: str) -> dict:
+    """Exact match on polarity, distance on score."""
+    polarity_match = 1.0 if pred_polarity == gt_polarity else 0.0
+    score_distance = max(0, 1.0 - abs(pred_score - gt_score) / 2.0)
+    return {"polarity": polarity_match, "score": score_distance}
+```
+#### Churn Scorer
+```python
+def score_churn(prediction: float, ground_truth: float, threshold=0.5) -> dict:
+    """Binary accuracy + calibration."""
+    binary = 1.0 if (prediction >= threshold) == (ground_truth >= threshold) else 0.0
+    calibration = max(0, 1.0 - abs(prediction - ground_truth))
+    return {"binary_accuracy": binary, "calibration": calibration}
+```
+#### Insights Scorer (LLM-as-Judge)
+```python
+JUDGE_PROMPT = """You are evaluating a Brazilian e-commerce analysis response.
+Rate on 4 dimensions (1-5 each):
+1. Relevance: Does it address the question?
+2. Accuracy: Are the claims factually reasonable?
+3. Actionability: Could a business act on this analysis?
+4. Portuguese Quality: Is the PT-BR natural and professional?
+Response to evaluate:
+{response}
+Question asked:
+{question}
+Return JSON: {"relevance": X, "accuracy": X, "actionability": X, "portuguese": X}"""
+def score_insights(response: str, question: str) -> dict:
+    """Use GPT-4o as judge, run 3x and average."""
+    scores = []
+    for _ in range(3):
+        result = openai.chat.completions.create(
+            model="gpt-4o",
+            messages=[{"role": "user", "content": JUDGE_PROMPT.format(
+                response=response, question=question
+            )}],
+            temperature=0.3
+        )
+        scores.append(json.loads(result.choices[0].message.content))
+    # Average across 3 runs
+    return {k: sum(s[k] for s in scores) / 3 for k in scores[0]}
+```
+### Step 1.3: Run the Benchmark Script
+```python
+# File: benchmark/run_benchmark.py
+import json
+from pathlib import Path
+def run_benchmark(model, tokenizer, prompts_path, output_path):
+    prompts = [json.loads(line) for line in open(prompts_path)]
+    results = []
+    for prompt in prompts:
+        # Generate response
+        messages = [
+            {"role": "system", "content": prompt["system"]},
+            {"role": "user", "content": prompt["prompt"]}
+        ]
+        response = generate(model, tokenizer, messages, max_new_tokens=2048, temperature=0.1)
+        # Score based on task type
+        scorer = SCORERS[prompt["task"]]
+        score = scorer(parse_response(response), prompt.get("ground_truth"))
+        results.append({
+            "id": prompt["id"],
+            "task": prompt["task"],
+            "difficulty": prompt["difficulty"],
+            "response": response,
+            "scores": score,
+            "tokens": len(tokenizer.encode(response))
+        })
+    # Save results
+    with open(output_path, "w") as f:
+        json.dump(results, f, indent=2, ensure_ascii=False)
+    # Print summary
+    for task in ["extraction", "sentiment", "sql", "churn", "insights"]:
+        task_results = [r for r in results if r["task"] == task]
+        mean_score = sum(r["scores"]["mean"] for r in task_results) / len(task_results)
+        print(f"{task}: {mean_score:.3f} (n={len(task_results)})")
+SCORERS = {
+    "extraction": score_extraction,
+    "sentiment": score_sentiment,
+    "sql": score_sql,
+    "churn": score_churn,
+    "insights": score_insights,
+}
+```
+### Step 1.4: Establish Baselines
+Run the benchmark on:
+1. **Qwen3-3.7B base** (zero-shot, no adapters) — this is your floor
+2. **Tucano2-SFT** (SFT adapter only) — this is your pre-GRPO baseline
+3. **Tucano2-GRPO v2** (current best) — this is what you're improving
+Record all results in a comparison table.
+---
+## Phase 2: Run Comparison vs. Qwen3-35B-A3B
+**Goal:** Prove domain-tuned 3.7B matches or beats general 35B on e-commerce tasks.
+**Time estimate:** 1 day (after benchmark is built)
+**Prerequisite:** Phase 1 complete
+### Step 2.1: Set Up Qwen3-35B-A3B
+This is a Mixture-of-Experts model (35B total, ~3B active per token). It should fit on your L4 with 4-bit quantization.
+```python
+from unsloth import FastLanguageModel
+model_35b, tokenizer_35b = FastLanguageModel.from_pretrained(
+    model_name="Qwen/Qwen3-35B-A3B",  # Verify exact HF repo name
+    max_seq_length=4096,
+    load_in_4bit=True,
+    dtype=None,  # Auto-detect
+)
+FastLanguageModel.for_inference(model_35b)
+```
+**Memory estimate:** 35B params × 0.5 bytes (4-bit) ≈ 17.5GB. Active inference ~3B ≈ 3GB compute. Should fit on L4 (24GB) with headroom.
+**If it doesn't fit:** Use `transformers` with `BitsAndBytesConfig(load_in_4bit=True)` instead of Unsloth, or try GGUF via `llama.cpp`.
+### Step 2.2: Run the Same Benchmark
+```python
+# Run identical benchmark on Qwen3-35B-A3B
+run_benchmark(model_35b, tokenizer_35b, "benchmark/prompts.jsonl", "results/qwen3-35b.json")
+```
+**Critical:** Use the same system prompt, same temperature (0.1 for eval), same max_new_tokens. The only variable should be the model.
+### Step 2.3: Optional — Add GPT-4o Baseline
+```python
+# For the strongest reference point
+import openai
+def run_benchmark_api(prompts_path, output_path, model="gpt-4o"):
+    prompts = [json.loads(line) for line in open(prompts_path)]
+    results = []
+    for prompt in prompts:
+        response = openai.chat.completions.create(
+            model=model,
+            messages=[
+                {"role": "system", "content": prompt["system"]},
+                {"role": "user", "content": prompt["prompt"]}
+            ],
+            temperature=0.1
+        )
+        # ... score same as above
+```
+### Step 2.4: Build the Comparison Report
+```
+┌──────────────┬────────────┬────────────┬────────────┬────────────┬──────────┐
+│ Task         │ Qwen3-3.7B │ Tucano2-SFT│ Tucano2-   │ Qwen3-35B  │ GPT-4o   │
+│              │ (base)     │            │ GRPO v2    │ (zero-shot)│(zero-shot│
+├──────────────┼────────────┼────────────┼────────────┼────────────┼──────────┤
+│ Extraction   │            │            │            │            │          │
+│ Sentiment    │            │            │            │            │          │
+│ SQL          │            │            │            │            │          │
+│ Churn        │            │            │            │            │          │
+│ Insights     │            │            │            │            │          │
+├──────────────┼────────────┼────────────┼────────────┼────────────┼──────────┤
+│ MEAN         │            │            │            │            │          │
+├──────────────┼────────────┼────────────┼────────────┼────────────┼──────────┤
+│ Cost/query   │ $0.001     │ $0.001     │ $0.001     │ $0.003     │ $0.010   │
+│ Latency (s)  │            │            │            │            │          │
+│ Privacy      │ ✅ On-prem │ ✅ On-prem │ ✅ On-prem │ ✅ On-prem │ ❌ API   │
+└──────────────┴────────────┴────────────┴────────────┴────────────┴──────────┘
+```
+**The story to tell:**
+- If Tucano2-GRPO ≥ Qwen3-35B on extraction/sentiment/SQL: "Domain tuning eliminates the need for 10× larger models"
+- If Tucano2-GRPO < Qwen3-35B on insights: "Open-ended reasoning still benefits from scale, but structured tasks don't"
+- Cost column proves the business case regardless of performance parity
+---
+## Phase 3: GRPO v3 Training Run
+**Goal:** Fix entropy collapse, break through the v2 performance plateau.
+**Time estimate:** 2-3 days (including data prep)
+**Prerequisite:** Phase 1 (benchmark needed to measure improvement)
+### Step 3.1: Expand Training Data (1000+ prompts)
+**Target: 1000 prompts** (up from 300), stratified:
+| Task | Current | Target | How to Expand |
+|------|---------|--------|--------------|
+| Extraction | ~75 | 300 | Generate from Olist dataset — sample reviews, create ground truth JSON |
+| Sentiment | ~60 | 200 | Sample from B2W reviews corpus, label with existing model + human review |
+| SQL | ~75 | 250 | Template-based: vary table names, WHERE clauses, aggregation patterns |
+| Churn | ~45 | 100 | Augment customer profiles with synthetic variations |
+| Insights | ~45 | 150 | LLM-generated analytical questions about e-commerce scenarios |
+**Synthetic generation recipe (from Cocktail Effect paper):**
+```python
+# Use your SFT model or GPT-4o to generate new training prompts
+# Then manually verify/correct the ground truth
+def generate_extraction_prompts(reviews_df, n=200):
+    prompts = []
+    for _, row in reviews_df.sample(n).iterrows():
+        prompt = format_extraction_prompt(row["review_text"], row["score"], row["status"])
+        # Generate ground truth with GPT-4o (higher quality than self-labeling)
+        ground_truth = gpt4o_extract(prompt)
+        prompts.append({"prompt": prompt, "ground_truth": ground_truth})
+    return prompts
+```
+**Also add 30% general reasoning data** (Cocktail Effect paper finding):
+- Source: Portuguese subset of Orca-Math or translated OpenOrca
+- Purpose: Regularization — prevents model from overfitting to domain patterns
+- Mix ratio: 700 domain + 300 general = 1000 total
+### Step 3.2: Configuration Changes for v3
+```python
+# === GRPO v3 Config Changes ===
+# FIX 1: Temperature — prevent entropy collapse
+TEMPERATURE = 1.0  # Was 0.8 in v2. All published GRPO papers use 1.0.
+# Reference: Skywork-OR1 (2505.22312) ablation shows τ=1.0 >> τ=0.6
+# FIX 2: Completion length — remove the ceiling
+MAX_COMPLETION_LENGTH = 4096  # Was 2048. Every v2 completion hit the ceiling.
+# Trade-off: halve num_generations to fit VRAM
+# FIX 3: Reduce generations to fit longer completions
+NUM_GENERATIONS = 4  # Was 8. 4 × 4096 ≈ 8 × 2048 in VRAM terms.
+# MC-GRPO paper shows G=4 can work if using median baseline
+# FIX 4: Explicit β=0 (no KL penalty)
+# Dr. GRPO paper: β=0 is optimal for rule-based rewards
+BETA = 0.0
+# FIX 5: Learning rate — slightly more aggressive
+LEARNING_RATE = 3e-6  # Was 2e-6. Clip ratios were all 0 → room to push harder.
+# Still well within published range (1e-6 to 5e-6)
+# FIX 6: Max steps for expanded data
+# 1000 prompts × 4 generations / (4 batch × 2 accum) = 500 steps/epoch
+MAX_STEPS = 500
+# FIX 7: Early stopping — more generous
+EARLY_STOPPING_PATIENCE = 15  # 15 evals × 10 steps = 150 steps of runway
+EVAL_STEPS = 10
+SAVE_STEPS = 10
+```
+### Step 3.3: Add Entropy Monitoring (Critical for v3)
+Since TRL 0.24.0 doesn't have native entropy bonus, implement via callback:
+```python
+class EntropyMonitorCallback(TrainerCallback):
+    """Monitor policy entropy to detect collapse early."""
+    def on_log(self, args, state, control, logs=None, **kwargs):
+        if logs and "train/completion_length" in logs:
+            # Proxy for entropy: if all completions are max length,
+            # entropy is collapsing (model is stuck in one mode)
+            completion_ratio = logs["train/completion_length"] / MAX_COMPLETION_LENGTH
+            if completion_ratio > 0.95:
+                print(f"⚠️ Step {state.global_step}: Completion ratio {completion_ratio:.2f} — "
+                      f"possible entropy collapse. Monitor reward_std.")
+            # Log to W&B
+            if wandb.run:
+                wandb.log({
+                    "monitor/completion_ratio": completion_ratio,
+                    "monitor/entropy_proxy": 1.0 - completion_ratio,
+                }, step=state.global_step)
+```
+### Step 3.4: Add Zero-Advantage Group Filtering
+```python
+# In the reward function wrapper or custom trainer:
+def filtered_commerce_reward_fn(completions, prompts, **kwargs):
+    """Compute rewards and flag zero-variance groups for filtering."""
+    rewards = commerce_reward_fn(completions, prompts, **kwargs)
+    # Group rewards by prompt (each prompt has NUM_GENERATIONS completions)
+    for i in range(0, len(rewards), NUM_GENERATIONS):
+        group = rewards[i:i+NUM_GENERATIONS]
+        if max(group) - min(group) < 0.01:  # Near-zero variance
+            # Add small noise to break the tie
+            # This prevents the GRPO denominator from exploding
+            for j in range(i, i+NUM_GENERATIONS):
+                rewards[j] += random.gauss(0, 0.005)
+    return rewards
+```
+**Note:** The proper fix is MC-GRPO's median baseline, but that requires modifying TRL internals. The noise injection is a pragmatic workaround for TRL 0.24.0.
+### Step 3.5: Reward Function Refinement
+Split the composite reward into staged components (Reasoning-SQL paper finding):
+```python
+def commerce_reward_fn_v3(completions, prompts, **kwargs):
+    """Multi-component reward with staged convergence."""
+    rewards = []
+    for completion, prompt in zip(completions, prompts):
+        # Stage 1: Format reward (converges first)
+        r_format = score_format(completion)  # JSON valid? Think tags closed? Right structure?
+        # Stage 2: Partial content reward
+        r_partial = score_partial_content(completion, prompt)  # Some fields correct?
+        # Stage 3: Full task reward
+        r_task = score_full_task(completion, prompt)  # All fields correct? SQL executes?
+        # Weighted combination — format weight decreases over training
+        reward = 0.2 * r_format + 0.3 * r_partial + 0.5 * r_task
+        rewards.append(reward)
+    return rewards
+```
+### Step 3.6: VRAM Budget for v3
+```
+L4 GPU: 24GB total
+Model (NF4):          ~3.5GB
+KV Cache (4096 seq):  ~2.0GB
+Activations:          ~4.0GB
+Optimizer states:     ~3.0GB
+Generations (4×4096): ~8.0GB
+─────────────────────────────
+Estimated total:      ~20.5GB
+Headroom:             ~3.5GB
+✅ Should fit. If OOM: reduce MAX_COMPLETION_LENGTH to 3072 first.
+```
+### Step 3.7: Training Execution
+```bash
+# Pre-flight checklist:
+# ✅ Benchmark built and baselines recorded (Phase 1)
+# ✅ 1000+ prompts prepared and validated
+# ✅ Config changes applied (temperature, completion length, generations, LR)
+# ✅ Entropy monitor callback added
+# ✅ Zero-advantage filtering active
+# ✅ Reward function v3 with staged components
+# ✅ FRESH=True (nuke old checkpoints)
+# ✅ W&B run name: grpo-v3-l4-{timestamp}
+# Expected runtime: 500 steps × ~5 min/step (longer completions) ≈ 42 hours
+# Checkpoints every 10 steps
+# Early stopping patience: 15 evals (150 steps)
+```
+### Step 3.8: Post-Training Validation
+1. Run Phase 1 benchmark on GRPO v3 best checkpoint
+2. Compare against all baselines:
+   - Qwen3-3.7B base → Tucano2-SFT → Tucano2-GRPO-v2 → **Tucano2-GRPO-v3**
+3. If v3 > v2 on benchmark: save as production model
+4. If v3 ≈ v2: entropy collapse not fully fixed; consider switching to MC-GRPO or upgrading GPU for longer completions
+5. If v3 < v2: rollback, investigate — likely reward function regression
+---
+## Decision Criteria for Stopping
+| Condition | Action |
+|-----------|--------|
+| v3 eval reward > 0.20 AND extraction score > 0.40 | Ship it — significantly better than v2 |
+| v3 eval reward 0.15-0.20, improving trend | Run epoch 2 (extend MAX_STEPS to 1000) |
+| v3 eval reward < v2 (0.125) | Stop, diagnose, review reward function and data |
+| Entropy collapse again (clip_ratio=0 after step 50) | Add entropy bonus via custom loss (requires TRL fork) |
+| OOM | Reduce MAX_COMPLETION_LENGTH to 3072 → 2560 → 2048 |
+---
+## Appendix: Literature References for Each Fix
+| Fix | Paper | Section | Key Finding |
+|-----|-------|---------|-------------|
+| Temperature=1.0 | Skywork-OR1 (2505.22312) | §4, Table 3 | τ=1.0 gives 5-8% better performance, delays entropy collapse |
+| β=0 (no KL) | Dr. GRPO (2503.20783) | §3.2 | KL penalty unnecessary for rule-based rewards |
+| scale_rewards=False | Dr. GRPO (2503.20783) | §3.1 | Std normalization biases toward low-variance groups |
+| Longer completions | Dr. GRPO (2503.20783) | §3.1 | GRPO length bias inflates wrong answers → ceiling hit |
+| Zero-advantage filtering | Skywork-OR1 (2505.22312) | §3.1 | Zero-std groups destabilize training |
+| Staged rewards | Reasoning-SQL (2503.23157) | §3.2 | Format rewards converge first, enable task learning |
+| General data mixing | Cocktail Effect (2410.01109) | §4 | 30% general data improves domain performance 2-15% |
+| G=4 with median | MC-GRPO (2601.22582) | §3 | Median baseline reduces noise at small group sizes |