rtferraz commited on
Commit
b47b36b
Β·
verified Β·
1 Parent(s): aa71b0c

docs: add ADR-001 next steps with detailed execution plans

Browse files
Files changed (1) hide show
  1. docs/ADR-001-next-steps.md +517 -0
docs/ADR-001-next-steps.md ADDED
@@ -0,0 +1,517 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ADR-001: Tucano2-Commerce Next Steps β€” Detailed Execution Plans
2
+
3
+ **Status:** Accepted
4
+ **Date:** 2025-04-23
5
+ **Author:** Rafael Ferraz
6
+ **Context:** GRPO v2 training completed (210/300 steps, early stopped). Mean validation reward 0.54 (+42% vs SFT baseline). Three critical issues diagnosed: entropy collapse, completion length ceiling, data scale. This ADR details the execution plan for the next phase.
7
+
8
+ ---
9
+
10
+ ## Phase 1: Build the Domain Benchmark
11
+
12
+ **Goal:** Create a rigorous, reproducible evaluation suite that measures Tucano2 across all 5 task types.
13
+ **Time estimate:** 1-2 days
14
+ **Prerequisite:** None β€” can start immediately
15
+
16
+ ### Step 1.1: Design the Benchmark Prompts
17
+
18
+ Create **80 held-out prompts** (never seen in training), stratified by task:
19
+
20
+ | Task | Count | Source | Difficulty Mix |
21
+ |------|-------|--------|---------------|
22
+ | Structured Extraction | 20 | Real reviews from Olist/B2W datasets | 10 easy (clear sentiment) + 10 hard (mixed/ambiguous) |
23
+ | Sentiment Analysis | 15 | Real reviews, balanced pos/neg/neutral | 5 per polarity |
24
+ | SQL Generation | 15 | Business questions against your e-commerce schema | 5 simple SELECT + 5 JOIN + 5 aggregate/window |
25
+ | Churn/Risk Prediction | 15 | Customer profiles with known outcomes | 5 low-risk + 5 medium + 5 high-risk |
26
+ | Business Insights | 15 | Open-ended analytical questions | 5 comparison + 5 trend + 5 recommendation |
27
+
28
+ **Implementation:**
29
+
30
+ ```python
31
+ # File: benchmark/prompts.jsonl
32
+ # Each line is a JSON object:
33
+ {
34
+ "id": "ext-001",
35
+ "task": "extraction",
36
+ "difficulty": "hard",
37
+ "prompt": "Analise esta avaliaΓ§Γ£o...",
38
+ "system": "<your system prompt>",
39
+ "ground_truth": {
40
+ "sentiment": "negativo",
41
+ "sentiment_score": -0.6,
42
+ "churn_risk": 0.8,
43
+ ...
44
+ },
45
+ "notes": "Mixed sentiment β€” product good but delivery terrible"
46
+ }
47
+ ```
48
+
49
+ **Key principles:**
50
+ - Include edge cases: mixed sentiment, sarcasm, regional slang ("barato saiu caro"), incomplete reviews
51
+ - SQL prompts must have executable ground truth queries against your actual schema
52
+ - Insights prompts should require multi-step reasoning, not one-hop lookups
53
+
54
+ ### Step 1.2: Build Automated Scoring Functions
55
+
56
+ Each task type gets its own scorer:
57
+
58
+ #### Extraction Scorer
59
+ ```python
60
+ def score_extraction(prediction: dict, ground_truth: dict) -> dict:
61
+ """Score each of the 10 JSON fields independently."""
62
+ scores = {}
63
+
64
+ # Categorical fields: exact match
65
+ for field in ["sentiment", "complaint_category"]:
66
+ scores[field] = 1.0 if pred[field] == gt[field] else 0.0
67
+
68
+ # Numeric fields: distance-based
69
+ for field in ["sentiment_score", "churn_risk", "repeat_intent"]:
70
+ scores[field] = max(0, 1.0 - abs(pred[field] - gt[field]))
71
+
72
+ # Boolean fields: exact match
73
+ for field in ["delivery_issue", "product_issue", "seller_issue", "would_recommend"]:
74
+ scores[field] = 1.0 if pred[field] == gt[field] else 0.0
75
+
76
+ # Text fields: semantic similarity (sentence-transformers)
77
+ for field in ["main_complaint"]:
78
+ scores[field] = cosine_similarity(embed(pred[field]), embed(gt[field]))
79
+
80
+ scores["mean"] = sum(scores.values()) / len(scores)
81
+ return scores
82
+ ```
83
+
84
+ #### SQL Scorer
85
+ ```python
86
+ def score_sql(predicted_sql: str, ground_truth_sql: str, db_connection) -> dict:
87
+ """Execute both queries and compare result sets."""
88
+ try:
89
+ pred_results = db_connection.execute(predicted_sql).fetchall()
90
+ gt_results = db_connection.execute(ground_truth_sql).fetchall()
91
+
92
+ # Execution accuracy (EX): do the result sets match?
93
+ ex = 1.0 if set(pred_results) == set(gt_results) else 0.0
94
+
95
+ # Partial credit: row overlap
96
+ overlap = len(set(pred_results) & set(gt_results))
97
+ partial = overlap / max(len(gt_results), 1)
98
+
99
+ return {"ex": ex, "partial": partial, "syntax_valid": 1.0}
100
+ except Exception:
101
+ return {"ex": 0.0, "partial": 0.0, "syntax_valid": 0.0}
102
+ ```
103
+
104
+ #### Sentiment Scorer
105
+ ```python
106
+ def score_sentiment(prediction: str, ground_truth: str) -> dict:
107
+ """Exact match on polarity, distance on score."""
108
+ polarity_match = 1.0 if pred_polarity == gt_polarity else 0.0
109
+ score_distance = max(0, 1.0 - abs(pred_score - gt_score) / 2.0)
110
+ return {"polarity": polarity_match, "score": score_distance}
111
+ ```
112
+
113
+ #### Churn Scorer
114
+ ```python
115
+ def score_churn(prediction: float, ground_truth: float, threshold=0.5) -> dict:
116
+ """Binary accuracy + calibration."""
117
+ binary = 1.0 if (prediction >= threshold) == (ground_truth >= threshold) else 0.0
118
+ calibration = max(0, 1.0 - abs(prediction - ground_truth))
119
+ return {"binary_accuracy": binary, "calibration": calibration}
120
+ ```
121
+
122
+ #### Insights Scorer (LLM-as-Judge)
123
+ ```python
124
+ JUDGE_PROMPT = """You are evaluating a Brazilian e-commerce analysis response.
125
+ Rate on 4 dimensions (1-5 each):
126
+ 1. Relevance: Does it address the question?
127
+ 2. Accuracy: Are the claims factually reasonable?
128
+ 3. Actionability: Could a business act on this analysis?
129
+ 4. Portuguese Quality: Is the PT-BR natural and professional?
130
+
131
+ Response to evaluate:
132
+ {response}
133
+
134
+ Question asked:
135
+ {question}
136
+
137
+ Return JSON: {"relevance": X, "accuracy": X, "actionability": X, "portuguese": X}"""
138
+
139
+ def score_insights(response: str, question: str) -> dict:
140
+ """Use GPT-4o as judge, run 3x and average."""
141
+ scores = []
142
+ for _ in range(3):
143
+ result = openai.chat.completions.create(
144
+ model="gpt-4o",
145
+ messages=[{"role": "user", "content": JUDGE_PROMPT.format(
146
+ response=response, question=question
147
+ )}],
148
+ temperature=0.3
149
+ )
150
+ scores.append(json.loads(result.choices[0].message.content))
151
+
152
+ # Average across 3 runs
153
+ return {k: sum(s[k] for s in scores) / 3 for k in scores[0]}
154
+ ```
155
+
156
+ ### Step 1.3: Run the Benchmark Script
157
+
158
+ ```python
159
+ # File: benchmark/run_benchmark.py
160
+ import json
161
+ from pathlib import Path
162
+
163
+ def run_benchmark(model, tokenizer, prompts_path, output_path):
164
+ prompts = [json.loads(line) for line in open(prompts_path)]
165
+ results = []
166
+
167
+ for prompt in prompts:
168
+ # Generate response
169
+ messages = [
170
+ {"role": "system", "content": prompt["system"]},
171
+ {"role": "user", "content": prompt["prompt"]}
172
+ ]
173
+ response = generate(model, tokenizer, messages, max_new_tokens=2048, temperature=0.1)
174
+
175
+ # Score based on task type
176
+ scorer = SCORERS[prompt["task"]]
177
+ score = scorer(parse_response(response), prompt.get("ground_truth"))
178
+
179
+ results.append({
180
+ "id": prompt["id"],
181
+ "task": prompt["task"],
182
+ "difficulty": prompt["difficulty"],
183
+ "response": response,
184
+ "scores": score,
185
+ "tokens": len(tokenizer.encode(response))
186
+ })
187
+
188
+ # Save results
189
+ with open(output_path, "w") as f:
190
+ json.dump(results, f, indent=2, ensure_ascii=False)
191
+
192
+ # Print summary
193
+ for task in ["extraction", "sentiment", "sql", "churn", "insights"]:
194
+ task_results = [r for r in results if r["task"] == task]
195
+ mean_score = sum(r["scores"]["mean"] for r in task_results) / len(task_results)
196
+ print(f"{task}: {mean_score:.3f} (n={len(task_results)})")
197
+
198
+ SCORERS = {
199
+ "extraction": score_extraction,
200
+ "sentiment": score_sentiment,
201
+ "sql": score_sql,
202
+ "churn": score_churn,
203
+ "insights": score_insights,
204
+ }
205
+ ```
206
+
207
+ ### Step 1.4: Establish Baselines
208
+
209
+ Run the benchmark on:
210
+ 1. **Qwen3-3.7B base** (zero-shot, no adapters) β€” this is your floor
211
+ 2. **Tucano2-SFT** (SFT adapter only) β€” this is your pre-GRPO baseline
212
+ 3. **Tucano2-GRPO v2** (current best) β€” this is what you're improving
213
+
214
+ Record all results in a comparison table.
215
+
216
+ ---
217
+
218
+ ## Phase 2: Run Comparison vs. Qwen3-35B-A3B
219
+
220
+ **Goal:** Prove domain-tuned 3.7B matches or beats general 35B on e-commerce tasks.
221
+ **Time estimate:** 1 day (after benchmark is built)
222
+ **Prerequisite:** Phase 1 complete
223
+
224
+ ### Step 2.1: Set Up Qwen3-35B-A3B
225
+
226
+ This is a Mixture-of-Experts model (35B total, ~3B active per token). It should fit on your L4 with 4-bit quantization.
227
+
228
+ ```python
229
+ from unsloth import FastLanguageModel
230
+
231
+ model_35b, tokenizer_35b = FastLanguageModel.from_pretrained(
232
+ model_name="Qwen/Qwen3-35B-A3B", # Verify exact HF repo name
233
+ max_seq_length=4096,
234
+ load_in_4bit=True,
235
+ dtype=None, # Auto-detect
236
+ )
237
+ FastLanguageModel.for_inference(model_35b)
238
+ ```
239
+
240
+ **Memory estimate:** 35B params Γ— 0.5 bytes (4-bit) β‰ˆ 17.5GB. Active inference ~3B β‰ˆ 3GB compute. Should fit on L4 (24GB) with headroom.
241
+
242
+ **If it doesn't fit:** Use `transformers` with `BitsAndBytesConfig(load_in_4bit=True)` instead of Unsloth, or try GGUF via `llama.cpp`.
243
+
244
+ ### Step 2.2: Run the Same Benchmark
245
+
246
+ ```python
247
+ # Run identical benchmark on Qwen3-35B-A3B
248
+ run_benchmark(model_35b, tokenizer_35b, "benchmark/prompts.jsonl", "results/qwen3-35b.json")
249
+ ```
250
+
251
+ **Critical:** Use the same system prompt, same temperature (0.1 for eval), same max_new_tokens. The only variable should be the model.
252
+
253
+ ### Step 2.3: Optional β€” Add GPT-4o Baseline
254
+
255
+ ```python
256
+ # For the strongest reference point
257
+ import openai
258
+
259
+ def run_benchmark_api(prompts_path, output_path, model="gpt-4o"):
260
+ prompts = [json.loads(line) for line in open(prompts_path)]
261
+ results = []
262
+
263
+ for prompt in prompts:
264
+ response = openai.chat.completions.create(
265
+ model=model,
266
+ messages=[
267
+ {"role": "system", "content": prompt["system"]},
268
+ {"role": "user", "content": prompt["prompt"]}
269
+ ],
270
+ temperature=0.1
271
+ )
272
+ # ... score same as above
273
+ ```
274
+
275
+ ### Step 2.4: Build the Comparison Report
276
+
277
+ ```
278
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
279
+ β”‚ Task β”‚ Qwen3-3.7B β”‚ Tucano2-SFTβ”‚ Tucano2- β”‚ Qwen3-35B β”‚ GPT-4o β”‚
280
+ β”‚ β”‚ (base) β”‚ β”‚ GRPO v2 β”‚ (zero-shot)β”‚(zero-shotβ”‚
281
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
282
+ β”‚ Extraction β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
283
+ β”‚ Sentiment β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
284
+ β”‚ SQL β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
285
+ β”‚ Churn β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
286
+ β”‚ Insights β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
287
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
288
+ β”‚ MEAN β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
289
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
290
+ β”‚ Cost/query β”‚ $0.001 β”‚ $0.001 β”‚ $0.001 β”‚ $0.003 β”‚ $0.010 β”‚
291
+ β”‚ Latency (s) β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
292
+ β”‚ Privacy β”‚ βœ… On-prem β”‚ βœ… On-prem β”‚ βœ… On-prem β”‚ βœ… On-prem β”‚ ❌ API β”‚
293
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
294
+ ```
295
+
296
+ **The story to tell:**
297
+ - If Tucano2-GRPO β‰₯ Qwen3-35B on extraction/sentiment/SQL: "Domain tuning eliminates the need for 10Γ— larger models"
298
+ - If Tucano2-GRPO < Qwen3-35B on insights: "Open-ended reasoning still benefits from scale, but structured tasks don't"
299
+ - Cost column proves the business case regardless of performance parity
300
+
301
+ ---
302
+
303
+ ## Phase 3: GRPO v3 Training Run
304
+
305
+ **Goal:** Fix entropy collapse, break through the v2 performance plateau.
306
+ **Time estimate:** 2-3 days (including data prep)
307
+ **Prerequisite:** Phase 1 (benchmark needed to measure improvement)
308
+
309
+ ### Step 3.1: Expand Training Data (1000+ prompts)
310
+
311
+ **Target: 1000 prompts** (up from 300), stratified:
312
+
313
+ | Task | Current | Target | How to Expand |
314
+ |------|---------|--------|--------------|
315
+ | Extraction | ~75 | 300 | Generate from Olist dataset β€” sample reviews, create ground truth JSON |
316
+ | Sentiment | ~60 | 200 | Sample from B2W reviews corpus, label with existing model + human review |
317
+ | SQL | ~75 | 250 | Template-based: vary table names, WHERE clauses, aggregation patterns |
318
+ | Churn | ~45 | 100 | Augment customer profiles with synthetic variations |
319
+ | Insights | ~45 | 150 | LLM-generated analytical questions about e-commerce scenarios |
320
+
321
+ **Synthetic generation recipe (from Cocktail Effect paper):**
322
+ ```python
323
+ # Use your SFT model or GPT-4o to generate new training prompts
324
+ # Then manually verify/correct the ground truth
325
+ def generate_extraction_prompts(reviews_df, n=200):
326
+ prompts = []
327
+ for _, row in reviews_df.sample(n).iterrows():
328
+ prompt = format_extraction_prompt(row["review_text"], row["score"], row["status"])
329
+ # Generate ground truth with GPT-4o (higher quality than self-labeling)
330
+ ground_truth = gpt4o_extract(prompt)
331
+ prompts.append({"prompt": prompt, "ground_truth": ground_truth})
332
+ return prompts
333
+ ```
334
+
335
+ **Also add 30% general reasoning data** (Cocktail Effect paper finding):
336
+ - Source: Portuguese subset of Orca-Math or translated OpenOrca
337
+ - Purpose: Regularization β€” prevents model from overfitting to domain patterns
338
+ - Mix ratio: 700 domain + 300 general = 1000 total
339
+
340
+ ### Step 3.2: Configuration Changes for v3
341
+
342
+ ```python
343
+ # === GRPO v3 Config Changes ===
344
+
345
+ # FIX 1: Temperature β€” prevent entropy collapse
346
+ TEMPERATURE = 1.0 # Was 0.8 in v2. All published GRPO papers use 1.0.
347
+ # Reference: Skywork-OR1 (2505.22312) ablation shows Ο„=1.0 >> Ο„=0.6
348
+
349
+ # FIX 2: Completion length β€” remove the ceiling
350
+ MAX_COMPLETION_LENGTH = 4096 # Was 2048. Every v2 completion hit the ceiling.
351
+ # Trade-off: halve num_generations to fit VRAM
352
+
353
+ # FIX 3: Reduce generations to fit longer completions
354
+ NUM_GENERATIONS = 4 # Was 8. 4 Γ— 4096 β‰ˆ 8 Γ— 2048 in VRAM terms.
355
+ # MC-GRPO paper shows G=4 can work if using median baseline
356
+
357
+ # FIX 4: Explicit Ξ²=0 (no KL penalty)
358
+ # Dr. GRPO paper: Ξ²=0 is optimal for rule-based rewards
359
+ BETA = 0.0
360
+
361
+ # FIX 5: Learning rate β€” slightly more aggressive
362
+ LEARNING_RATE = 3e-6 # Was 2e-6. Clip ratios were all 0 β†’ room to push harder.
363
+ # Still well within published range (1e-6 to 5e-6)
364
+
365
+ # FIX 6: Max steps for expanded data
366
+ # 1000 prompts Γ— 4 generations / (4 batch Γ— 2 accum) = 500 steps/epoch
367
+ MAX_STEPS = 500
368
+
369
+ # FIX 7: Early stopping β€” more generous
370
+ EARLY_STOPPING_PATIENCE = 15 # 15 evals Γ— 10 steps = 150 steps of runway
371
+ EVAL_STEPS = 10
372
+ SAVE_STEPS = 10
373
+ ```
374
+
375
+ ### Step 3.3: Add Entropy Monitoring (Critical for v3)
376
+
377
+ Since TRL 0.24.0 doesn't have native entropy bonus, implement via callback:
378
+
379
+ ```python
380
+ class EntropyMonitorCallback(TrainerCallback):
381
+ """Monitor policy entropy to detect collapse early."""
382
+
383
+ def on_log(self, args, state, control, logs=None, **kwargs):
384
+ if logs and "train/completion_length" in logs:
385
+ # Proxy for entropy: if all completions are max length,
386
+ # entropy is collapsing (model is stuck in one mode)
387
+ completion_ratio = logs["train/completion_length"] / MAX_COMPLETION_LENGTH
388
+
389
+ if completion_ratio > 0.95:
390
+ print(f"⚠️ Step {state.global_step}: Completion ratio {completion_ratio:.2f} β€” "
391
+ f"possible entropy collapse. Monitor reward_std.")
392
+
393
+ # Log to W&B
394
+ if wandb.run:
395
+ wandb.log({
396
+ "monitor/completion_ratio": completion_ratio,
397
+ "monitor/entropy_proxy": 1.0 - completion_ratio,
398
+ }, step=state.global_step)
399
+ ```
400
+
401
+ ### Step 3.4: Add Zero-Advantage Group Filtering
402
+
403
+ ```python
404
+ # In the reward function wrapper or custom trainer:
405
+ def filtered_commerce_reward_fn(completions, prompts, **kwargs):
406
+ """Compute rewards and flag zero-variance groups for filtering."""
407
+ rewards = commerce_reward_fn(completions, prompts, **kwargs)
408
+
409
+ # Group rewards by prompt (each prompt has NUM_GENERATIONS completions)
410
+ for i in range(0, len(rewards), NUM_GENERATIONS):
411
+ group = rewards[i:i+NUM_GENERATIONS]
412
+ if max(group) - min(group) < 0.01: # Near-zero variance
413
+ # Add small noise to break the tie
414
+ # This prevents the GRPO denominator from exploding
415
+ for j in range(i, i+NUM_GENERATIONS):
416
+ rewards[j] += random.gauss(0, 0.005)
417
+
418
+ return rewards
419
+ ```
420
+
421
+ **Note:** The proper fix is MC-GRPO's median baseline, but that requires modifying TRL internals. The noise injection is a pragmatic workaround for TRL 0.24.0.
422
+
423
+ ### Step 3.5: Reward Function Refinement
424
+
425
+ Split the composite reward into staged components (Reasoning-SQL paper finding):
426
+
427
+ ```python
428
+ def commerce_reward_fn_v3(completions, prompts, **kwargs):
429
+ """Multi-component reward with staged convergence."""
430
+ rewards = []
431
+ for completion, prompt in zip(completions, prompts):
432
+ # Stage 1: Format reward (converges first)
433
+ r_format = score_format(completion) # JSON valid? Think tags closed? Right structure?
434
+
435
+ # Stage 2: Partial content reward
436
+ r_partial = score_partial_content(completion, prompt) # Some fields correct?
437
+
438
+ # Stage 3: Full task reward
439
+ r_task = score_full_task(completion, prompt) # All fields correct? SQL executes?
440
+
441
+ # Weighted combination β€” format weight decreases over training
442
+ reward = 0.2 * r_format + 0.3 * r_partial + 0.5 * r_task
443
+ rewards.append(reward)
444
+
445
+ return rewards
446
+ ```
447
+
448
+ ### Step 3.6: VRAM Budget for v3
449
+
450
+ ```
451
+ L4 GPU: 24GB total
452
+
453
+ Model (NF4): ~3.5GB
454
+ KV Cache (4096 seq): ~2.0GB
455
+ Activations: ~4.0GB
456
+ Optimizer states: ~3.0GB
457
+ Generations (4Γ—4096): ~8.0GB
458
+ ─────────────────────────────
459
+ Estimated total: ~20.5GB
460
+ Headroom: ~3.5GB
461
+
462
+ βœ… Should fit. If OOM: reduce MAX_COMPLETION_LENGTH to 3072 first.
463
+ ```
464
+
465
+ ### Step 3.7: Training Execution
466
+
467
+ ```bash
468
+ # Pre-flight checklist:
469
+ # βœ… Benchmark built and baselines recorded (Phase 1)
470
+ # βœ… 1000+ prompts prepared and validated
471
+ # βœ… Config changes applied (temperature, completion length, generations, LR)
472
+ # βœ… Entropy monitor callback added
473
+ # βœ… Zero-advantage filtering active
474
+ # βœ… Reward function v3 with staged components
475
+ # βœ… FRESH=True (nuke old checkpoints)
476
+ # βœ… W&B run name: grpo-v3-l4-{timestamp}
477
+
478
+ # Expected runtime: 500 steps Γ— ~5 min/step (longer completions) β‰ˆ 42 hours
479
+ # Checkpoints every 10 steps
480
+ # Early stopping patience: 15 evals (150 steps)
481
+ ```
482
+
483
+ ### Step 3.8: Post-Training Validation
484
+
485
+ 1. Run Phase 1 benchmark on GRPO v3 best checkpoint
486
+ 2. Compare against all baselines:
487
+ - Qwen3-3.7B base β†’ Tucano2-SFT β†’ Tucano2-GRPO-v2 β†’ **Tucano2-GRPO-v3**
488
+ 3. If v3 > v2 on benchmark: save as production model
489
+ 4. If v3 β‰ˆ v2: entropy collapse not fully fixed; consider switching to MC-GRPO or upgrading GPU for longer completions
490
+ 5. If v3 < v2: rollback, investigate β€” likely reward function regression
491
+
492
+ ---
493
+
494
+ ## Decision Criteria for Stopping
495
+
496
+ | Condition | Action |
497
+ |-----------|--------|
498
+ | v3 eval reward > 0.20 AND extraction score > 0.40 | Ship it β€” significantly better than v2 |
499
+ | v3 eval reward 0.15-0.20, improving trend | Run epoch 2 (extend MAX_STEPS to 1000) |
500
+ | v3 eval reward < v2 (0.125) | Stop, diagnose, review reward function and data |
501
+ | Entropy collapse again (clip_ratio=0 after step 50) | Add entropy bonus via custom loss (requires TRL fork) |
502
+ | OOM | Reduce MAX_COMPLETION_LENGTH to 3072 β†’ 2560 β†’ 2048 |
503
+
504
+ ---
505
+
506
+ ## Appendix: Literature References for Each Fix
507
+
508
+ | Fix | Paper | Section | Key Finding |
509
+ |-----|-------|---------|-------------|
510
+ | Temperature=1.0 | Skywork-OR1 (2505.22312) | Β§4, Table 3 | Ο„=1.0 gives 5-8% better performance, delays entropy collapse |
511
+ | Ξ²=0 (no KL) | Dr. GRPO (2503.20783) | Β§3.2 | KL penalty unnecessary for rule-based rewards |
512
+ | scale_rewards=False | Dr. GRPO (2503.20783) | Β§3.1 | Std normalization biases toward low-variance groups |
513
+ | Longer completions | Dr. GRPO (2503.20783) | Β§3.1 | GRPO length bias inflates wrong answers β†’ ceiling hit |
514
+ | Zero-advantage filtering | Skywork-OR1 (2505.22312) | Β§3.1 | Zero-std groups destabilize training |
515
+ | Staged rewards | Reasoning-SQL (2503.23157) | Β§3.2 | Format rewards converge first, enable task learning |
516
+ | General data mixing | Cocktail Effect (2410.01109) | Β§4 | 30% general data improves domain performance 2-15% |
517
+ | G=4 with median | MC-GRPO (2601.22582) | Β§3 | Median baseline reduces noise at small group sizes |