tucano2-commerce / docs /ADR-001-next-steps.md

docs: add ADR-001 next steps with detailed execution plans

b47b36b verified 15 days ago

21.3 kB

	# ADR-001: Tucano2-Commerce Next Steps — Detailed Execution Plans

	Status: Accepted
	Date: 2025-04-23
	Author: Rafael Ferraz
	Context: GRPO v2 training completed (210/300 steps, early stopped). Mean validation reward 0.54 (+42% vs SFT baseline). Three critical issues diagnosed: entropy collapse, completion length ceiling, data scale. This ADR details the execution plan for the next phase.

	---

	## Phase 1: Build the Domain Benchmark

	Goal: Create a rigorous, reproducible evaluation suite that measures Tucano2 across all 5 task types.
	Time estimate: 1-2 days
	Prerequisite: None — can start immediately

	### Step 1.1: Design the Benchmark Prompts

	Create 80 held-out prompts (never seen in training), stratified by task:

	\| Task \| Count \| Source \| Difficulty Mix \|
	\|------\|-------\|--------\|---------------\|
	\| Structured Extraction \| 20 \| Real reviews from Olist/B2W datasets \| 10 easy (clear sentiment) + 10 hard (mixed/ambiguous) \|
	\| Sentiment Analysis \| 15 \| Real reviews, balanced pos/neg/neutral \| 5 per polarity \|
	\| SQL Generation \| 15 \| Business questions against your e-commerce schema \| 5 simple SELECT + 5 JOIN + 5 aggregate/window \|
	\| Churn/Risk Prediction \| 15 \| Customer profiles with known outcomes \| 5 low-risk + 5 medium + 5 high-risk \|
	\| Business Insights \| 15 \| Open-ended analytical questions \| 5 comparison + 5 trend + 5 recommendation \|

	Implementation:

	```python
	# File: benchmark/prompts.jsonl
	# Each line is a JSON object:
	{
	"id": "ext-001",
	"task": "extraction",
	"difficulty": "hard",
	"prompt": "Analise esta avaliação...",
	"system": "<your system prompt>",
	"ground_truth": {
	"sentiment": "negativo",
	"sentiment_score": -0.6,
	"churn_risk": 0.8,
	...
	},
	"notes": "Mixed sentiment — product good but delivery terrible"
	}
	```

	Key principles:
	- Include edge cases: mixed sentiment, sarcasm, regional slang ("barato saiu caro"), incomplete reviews
	- SQL prompts must have executable ground truth queries against your actual schema
	- Insights prompts should require multi-step reasoning, not one-hop lookups

	### Step 1.2: Build Automated Scoring Functions

	Each task type gets its own scorer:

	#### Extraction Scorer
	```python
	def score_extraction(prediction: dict, ground_truth: dict) -> dict:
	"""Score each of the 10 JSON fields independently."""
	scores = {}

	# Categorical fields: exact match
	for field in ["sentiment", "complaint_category"]:
	scores[field] = 1.0 if pred[field] == gt[field] else 0.0

	# Numeric fields: distance-based
	for field in ["sentiment_score", "churn_risk", "repeat_intent"]:
	scores[field] = max(0, 1.0 - abs(pred[field] - gt[field]))

	# Boolean fields: exact match
	for field in ["delivery_issue", "product_issue", "seller_issue", "would_recommend"]:
	scores[field] = 1.0 if pred[field] == gt[field] else 0.0

	# Text fields: semantic similarity (sentence-transformers)
	for field in ["main_complaint"]:
	scores[field] = cosine_similarity(embed(pred[field]), embed(gt[field]))

	scores["mean"] = sum(scores.values()) / len(scores)
	return scores
	```

	#### SQL Scorer
	```python
	def score_sql(predicted_sql: str, ground_truth_sql: str, db_connection) -> dict:
	"""Execute both queries and compare result sets."""
	try:
	pred_results = db_connection.execute(predicted_sql).fetchall()
	gt_results = db_connection.execute(ground_truth_sql).fetchall()

	# Execution accuracy (EX): do the result sets match?
	ex = 1.0 if set(pred_results) == set(gt_results) else 0.0

	# Partial credit: row overlap
	overlap = len(set(pred_results) & set(gt_results))
	partial = overlap / max(len(gt_results), 1)

	return {"ex": ex, "partial": partial, "syntax_valid": 1.0}
	except Exception:
	return {"ex": 0.0, "partial": 0.0, "syntax_valid": 0.0}
	```

	#### Sentiment Scorer
	```python
	def score_sentiment(prediction: str, ground_truth: str) -> dict:
	"""Exact match on polarity, distance on score."""
	polarity_match = 1.0 if pred_polarity == gt_polarity else 0.0
	score_distance = max(0, 1.0 - abs(pred_score - gt_score) / 2.0)
	return {"polarity": polarity_match, "score": score_distance}
	```

	#### Churn Scorer
	```python
	def score_churn(prediction: float, ground_truth: float, threshold=0.5) -> dict:
	"""Binary accuracy + calibration."""
	binary = 1.0 if (prediction >= threshold) == (ground_truth >= threshold) else 0.0
	calibration = max(0, 1.0 - abs(prediction - ground_truth))
	return {"binary_accuracy": binary, "calibration": calibration}
	```

	#### Insights Scorer (LLM-as-Judge)
	```python
	JUDGE_PROMPT = """You are evaluating a Brazilian e-commerce analysis response.
	Rate on 4 dimensions (1-5 each):
	1. Relevance: Does it address the question?
	2. Accuracy: Are the claims factually reasonable?
	3. Actionability: Could a business act on this analysis?
	4. Portuguese Quality: Is the PT-BR natural and professional?

	Response to evaluate:
	{response}

	Question asked:
	{question}

	Return JSON: {"relevance": X, "accuracy": X, "actionability": X, "portuguese": X}"""

	def score_insights(response: str, question: str) -> dict:
	"""Use GPT-4o as judge, run 3x and average."""
	scores = []
	for _ in range(3):
	result = openai.chat.completions.create(
	model="gpt-4o",
	messages=[{"role": "user", "content": JUDGE_PROMPT.format(
	response=response, question=question
	)}],
	temperature=0.3
	)
	scores.append(json.loads(result.choices[0].message.content))

	# Average across 3 runs
	return {k: sum(s[k] for s in scores) / 3 for k in scores[0]}
	```

	### Step 1.3: Run the Benchmark Script

	```python
	# File: benchmark/run_benchmark.py
	import json
	from pathlib import Path

	def run_benchmark(model, tokenizer, prompts_path, output_path):
	prompts = [json.loads(line) for line in open(prompts_path)]
	results = []

	for prompt in prompts:
	# Generate response
	messages = [
	{"role": "system", "content": prompt["system"]},
	{"role": "user", "content": prompt["prompt"]}
	]
	response = generate(model, tokenizer, messages, max_new_tokens=2048, temperature=0.1)

	# Score based on task type
	scorer = SCORERS[prompt["task"]]
	score = scorer(parse_response(response), prompt.get("ground_truth"))

	results.append({
	"id": prompt["id"],
	"task": prompt["task"],
	"difficulty": prompt["difficulty"],
	"response": response,
	"scores": score,
	"tokens": len(tokenizer.encode(response))
	})

	# Save results
	with open(output_path, "w") as f:
	json.dump(results, f, indent=2, ensure_ascii=False)

	# Print summary
	for task in ["extraction", "sentiment", "sql", "churn", "insights"]:
	task_results = [r for r in results if r["task"] == task]
	mean_score = sum(r["scores"]["mean"] for r in task_results) / len(task_results)
	print(f"{task}: {mean_score:.3f} (n={len(task_results)})")

	SCORERS = {
	"extraction": score_extraction,
	"sentiment": score_sentiment,
	"sql": score_sql,
	"churn": score_churn,
	"insights": score_insights,
	}
	```

	### Step 1.4: Establish Baselines

	Run the benchmark on:
	1. Qwen3-3.7B base (zero-shot, no adapters) — this is your floor
	2. Tucano2-SFT (SFT adapter only) — this is your pre-GRPO baseline
	3. Tucano2-GRPO v2 (current best) — this is what you're improving

	Record all results in a comparison table.

	---

	## Phase 2: Run Comparison vs. Qwen3-35B-A3B

	Goal: Prove domain-tuned 3.7B matches or beats general 35B on e-commerce tasks.
	Time estimate: 1 day (after benchmark is built)
	Prerequisite: Phase 1 complete

	### Step 2.1: Set Up Qwen3-35B-A3B

	This is a Mixture-of-Experts model (35B total, ~3B active per token). It should fit on your L4 with 4-bit quantization.

	```python
	from unsloth import FastLanguageModel

	model_35b, tokenizer_35b = FastLanguageModel.from_pretrained(
	model_name="Qwen/Qwen3-35B-A3B", # Verify exact HF repo name
	max_seq_length=4096,
	load_in_4bit=True,
	dtype=None, # Auto-detect
	)
	FastLanguageModel.for_inference(model_35b)
	```

	Memory estimate: 35B params × 0.5 bytes (4-bit) ≈ 17.5GB. Active inference ~3B ≈ 3GB compute. Should fit on L4 (24GB) with headroom.

	If it doesn't fit: Use `transformers` with `BitsAndBytesConfig(load_in_4bit=True)` instead of Unsloth, or try GGUF via `llama.cpp`.

	### Step 2.2: Run the Same Benchmark

	```python
	# Run identical benchmark on Qwen3-35B-A3B
	run_benchmark(model_35b, tokenizer_35b, "benchmark/prompts.jsonl", "results/qwen3-35b.json")
	```

	Critical: Use the same system prompt, same temperature (0.1 for eval), same max_new_tokens. The only variable should be the model.

	### Step 2.3: Optional — Add GPT-4o Baseline

	```python
	# For the strongest reference point
	import openai

	def run_benchmark_api(prompts_path, output_path, model="gpt-4o"):
	prompts = [json.loads(line) for line in open(prompts_path)]
	results = []

	for prompt in prompts:
	response = openai.chat.completions.create(
	model=model,
	messages=[
	{"role": "system", "content": prompt["system"]},
	{"role": "user", "content": prompt["prompt"]}
	],
	temperature=0.1
	)
	# ... score same as above
	```

	### Step 2.4: Build the Comparison Report

	```
	┌──────────────┬────────────┬────────────┬────────────┬────────────┬──────────┐
	│ Task │ Qwen3-3.7B │ Tucano2-SFT│ Tucano2- │ Qwen3-35B │ GPT-4o │
	│ │ (base) │ │ GRPO v2 │ (zero-shot)│(zero-shot│
	├──────────────┼────────────┼────────────┼────────────┼────────────┼──────────┤
	│ Extraction │ │ │ │ │ │
	│ Sentiment │ │ │ │ │ │
	│ SQL │ │ │ │ │ │
	│ Churn │ │ │ │ │ │
	│ Insights │ │ │ │ │ │
	├──────────────┼────────────┼────────────┼────────────┼────────────┼──────────┤
	│ MEAN │ │ │ │ │ │
	├──────────────┼────────────┼────────────┼────────────┼────────────┼──────────┤
	│ Cost/query │ $0.001 │ $0.001 │ $0.001 │ $0.003 │ $0.010 │
	│ Latency (s) │ │ │ │ │ │
	│ Privacy │ ✅ On-prem │ ✅ On-prem │ ✅ On-prem │ ✅ On-prem │ ❌ API │
	└──────────────┴────────────┴────────────┴────────────┴────────────┴──────────┘
	```

	The story to tell:
	- If Tucano2-GRPO ≥ Qwen3-35B on extraction/sentiment/SQL: "Domain tuning eliminates the need for 10× larger models"
	- If Tucano2-GRPO < Qwen3-35B on insights: "Open-ended reasoning still benefits from scale, but structured tasks don't"
	- Cost column proves the business case regardless of performance parity

	---

	## Phase 3: GRPO v3 Training Run

	Goal: Fix entropy collapse, break through the v2 performance plateau.
	Time estimate: 2-3 days (including data prep)
	Prerequisite: Phase 1 (benchmark needed to measure improvement)

	### Step 3.1: Expand Training Data (1000+ prompts)

	Target: 1000 prompts (up from 300), stratified:

	\| Task \| Current \| Target \| How to Expand \|
	\|------\|---------\|--------\|--------------\|
	\| Extraction \| ~75 \| 300 \| Generate from Olist dataset — sample reviews, create ground truth JSON \|
	\| Sentiment \| ~60 \| 200 \| Sample from B2W reviews corpus, label with existing model + human review \|
	\| SQL \| ~75 \| 250 \| Template-based: vary table names, WHERE clauses, aggregation patterns \|
	\| Churn \| ~45 \| 100 \| Augment customer profiles with synthetic variations \|
	\| Insights \| ~45 \| 150 \| LLM-generated analytical questions about e-commerce scenarios \|

	Synthetic generation recipe (from Cocktail Effect paper):
	```python
	# Use your SFT model or GPT-4o to generate new training prompts
	# Then manually verify/correct the ground truth
	def generate_extraction_prompts(reviews_df, n=200):
	prompts = []
	for _, row in reviews_df.sample(n).iterrows():
	prompt = format_extraction_prompt(row["review_text"], row["score"], row["status"])
	# Generate ground truth with GPT-4o (higher quality than self-labeling)
	ground_truth = gpt4o_extract(prompt)
	prompts.append({"prompt": prompt, "ground_truth": ground_truth})
	return prompts
	```

	Also add 30% general reasoning data (Cocktail Effect paper finding):
	- Source: Portuguese subset of Orca-Math or translated OpenOrca
	- Purpose: Regularization — prevents model from overfitting to domain patterns
	- Mix ratio: 700 domain + 300 general = 1000 total

	### Step 3.2: Configuration Changes for v3

	```python
	# === GRPO v3 Config Changes ===

	# FIX 1: Temperature — prevent entropy collapse
	TEMPERATURE = 1.0 # Was 0.8 in v2. All published GRPO papers use 1.0.
	# Reference: Skywork-OR1 (2505.22312) ablation shows τ=1.0 >> τ=0.6

	# FIX 2: Completion length — remove the ceiling
	MAX_COMPLETION_LENGTH = 4096 # Was 2048. Every v2 completion hit the ceiling.
	# Trade-off: halve num_generations to fit VRAM

	# FIX 3: Reduce generations to fit longer completions
	NUM_GENERATIONS = 4 # Was 8. 4 × 4096 ≈ 8 × 2048 in VRAM terms.
	# MC-GRPO paper shows G=4 can work if using median baseline

	# FIX 4: Explicit β=0 (no KL penalty)
	# Dr. GRPO paper: β=0 is optimal for rule-based rewards
	BETA = 0.0

	# FIX 5: Learning rate — slightly more aggressive
	LEARNING_RATE = 3e-6 # Was 2e-6. Clip ratios were all 0 → room to push harder.
	# Still well within published range (1e-6 to 5e-6)

	# FIX 6: Max steps for expanded data
	# 1000 prompts × 4 generations / (4 batch × 2 accum) = 500 steps/epoch
	MAX_STEPS = 500

	# FIX 7: Early stopping — more generous
	EARLY_STOPPING_PATIENCE = 15 # 15 evals × 10 steps = 150 steps of runway
	EVAL_STEPS = 10
	SAVE_STEPS = 10
	```

	### Step 3.3: Add Entropy Monitoring (Critical for v3)

	Since TRL 0.24.0 doesn't have native entropy bonus, implement via callback:

	```python
	class EntropyMonitorCallback(TrainerCallback):
	"""Monitor policy entropy to detect collapse early."""

	def on_log(self, args, state, control, logs=None, **kwargs):
	if logs and "train/completion_length" in logs:
	# Proxy for entropy: if all completions are max length,
	# entropy is collapsing (model is stuck in one mode)
	completion_ratio = logs["train/completion_length"] / MAX_COMPLETION_LENGTH

	if completion_ratio > 0.95:
	print(f"⚠️ Step {state.global_step}: Completion ratio {completion_ratio:.2f} — "
	f"possible entropy collapse. Monitor reward_std.")

	# Log to W&B
	if wandb.run:
	wandb.log({
	"monitor/completion_ratio": completion_ratio,
	"monitor/entropy_proxy": 1.0 - completion_ratio,
	}, step=state.global_step)
	```

	### Step 3.4: Add Zero-Advantage Group Filtering

	```python
	# In the reward function wrapper or custom trainer:
	def filtered_commerce_reward_fn(completions, prompts, **kwargs):
	"""Compute rewards and flag zero-variance groups for filtering."""
	rewards = commerce_reward_fn(completions, prompts, **kwargs)

	# Group rewards by prompt (each prompt has NUM_GENERATIONS completions)
	for i in range(0, len(rewards), NUM_GENERATIONS):
	group = rewards[i:i+NUM_GENERATIONS]
	if max(group) - min(group) < 0.01: # Near-zero variance
	# Add small noise to break the tie
	# This prevents the GRPO denominator from exploding
	for j in range(i, i+NUM_GENERATIONS):
	rewards[j] += random.gauss(0, 0.005)

	return rewards
	```

	Note: The proper fix is MC-GRPO's median baseline, but that requires modifying TRL internals. The noise injection is a pragmatic workaround for TRL 0.24.0.

	### Step 3.5: Reward Function Refinement

	Split the composite reward into staged components (Reasoning-SQL paper finding):

	```python
	def commerce_reward_fn_v3(completions, prompts, **kwargs):
	"""Multi-component reward with staged convergence."""
	rewards = []
	for completion, prompt in zip(completions, prompts):
	# Stage 1: Format reward (converges first)
	r_format = score_format(completion) # JSON valid? Think tags closed? Right structure?

	# Stage 2: Partial content reward
	r_partial = score_partial_content(completion, prompt) # Some fields correct?

	# Stage 3: Full task reward
	r_task = score_full_task(completion, prompt) # All fields correct? SQL executes?

	# Weighted combination — format weight decreases over training
	reward = 0.2 * r_format + 0.3 * r_partial + 0.5 * r_task
	rewards.append(reward)

	return rewards
	```

	### Step 3.6: VRAM Budget for v3

	```
	L4 GPU: 24GB total

	Model (NF4): ~3.5GB
	KV Cache (4096 seq): ~2.0GB
	Activations: ~4.0GB
	Optimizer states: ~3.0GB
	Generations (4×4096): ~8.0GB
	─────────────────────────────
	Estimated total: ~20.5GB
	Headroom: ~3.5GB

	✅ Should fit. If OOM: reduce MAX_COMPLETION_LENGTH to 3072 first.
	```

	### Step 3.7: Training Execution

	```bash
	# Pre-flight checklist:
	# ✅ Benchmark built and baselines recorded (Phase 1)
	# ✅ 1000+ prompts prepared and validated
	# ✅ Config changes applied (temperature, completion length, generations, LR)
	# ✅ Entropy monitor callback added
	# ✅ Zero-advantage filtering active
	# ✅ Reward function v3 with staged components
	# ✅ FRESH=True (nuke old checkpoints)
	# ✅ W&B run name: grpo-v3-l4-{timestamp}

	# Expected runtime: 500 steps × ~5 min/step (longer completions) ≈ 42 hours
	# Checkpoints every 10 steps
	# Early stopping patience: 15 evals (150 steps)
	```

	### Step 3.8: Post-Training Validation

	1. Run Phase 1 benchmark on GRPO v3 best checkpoint
	2. Compare against all baselines:
	- Qwen3-3.7B base → Tucano2-SFT → Tucano2-GRPO-v2 → Tucano2-GRPO-v3
	3. If v3 > v2 on benchmark: save as production model
	4. If v3 ≈ v2: entropy collapse not fully fixed; consider switching to MC-GRPO or upgrading GPU for longer completions
	5. If v3 < v2: rollback, investigate — likely reward function regression

	---

	## Decision Criteria for Stopping

	\| Condition \| Action \|
	\|-----------\|--------\|
	\| v3 eval reward > 0.20 AND extraction score > 0.40 \| Ship it — significantly better than v2 \|
	\| v3 eval reward 0.15-0.20, improving trend \| Run epoch 2 (extend MAX_STEPS to 1000) \|
	\| v3 eval reward < v2 (0.125) \| Stop, diagnose, review reward function and data \|
	\| Entropy collapse again (clip_ratio=0 after step 50) \| Add entropy bonus via custom loss (requires TRL fork) \|
	\| OOM \| Reduce MAX_COMPLETION_LENGTH to 3072 → 2560 → 2048 \|

	---

	## Appendix: Literature References for Each Fix

	\| Fix \| Paper \| Section \| Key Finding \|
	\|-----\|-------\|---------\|-------------\|
	\| Temperature=1.0 \| Skywork-OR1 (2505.22312) \| §4, Table 3 \| τ=1.0 gives 5-8% better performance, delays entropy collapse \|
	\| β=0 (no KL) \| Dr. GRPO (2503.20783) \| §3.2 \| KL penalty unnecessary for rule-based rewards \|
	\| scale_rewards=False \| Dr. GRPO (2503.20783) \| §3.1 \| Std normalization biases toward low-variance groups \|
	\| Longer completions \| Dr. GRPO (2503.20783) \| §3.1 \| GRPO length bias inflates wrong answers → ceiling hit \|
	\| Zero-advantage filtering \| Skywork-OR1 (2505.22312) \| §3.1 \| Zero-std groups destabilize training \|
	\| Staged rewards \| Reasoning-SQL (2503.23157) \| §3.2 \| Format rewards converge first, enable task learning \|
	\| General data mixing \| Cocktail Effect (2410.01109) \| §4 \| 30% general data improves domain performance 2-15% \|
	\| G=4 with median \| MC-GRPO (2601.22582) \| §3 \| Median baseline reduces noise at small group sizes \|