File size: 21,304 Bytes
b47b36b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
# ADR-001: Tucano2-Commerce Next Steps β€” Detailed Execution Plans

**Status:** Accepted  
**Date:** 2025-04-23  
**Author:** Rafael Ferraz  
**Context:** GRPO v2 training completed (210/300 steps, early stopped). Mean validation reward 0.54 (+42% vs SFT baseline). Three critical issues diagnosed: entropy collapse, completion length ceiling, data scale. This ADR details the execution plan for the next phase.

---

## Phase 1: Build the Domain Benchmark

**Goal:** Create a rigorous, reproducible evaluation suite that measures Tucano2 across all 5 task types.  
**Time estimate:** 1-2 days  
**Prerequisite:** None β€” can start immediately

### Step 1.1: Design the Benchmark Prompts

Create **80 held-out prompts** (never seen in training), stratified by task:

| Task | Count | Source | Difficulty Mix |
|------|-------|--------|---------------|
| Structured Extraction | 20 | Real reviews from Olist/B2W datasets | 10 easy (clear sentiment) + 10 hard (mixed/ambiguous) |
| Sentiment Analysis | 15 | Real reviews, balanced pos/neg/neutral | 5 per polarity |
| SQL Generation | 15 | Business questions against your e-commerce schema | 5 simple SELECT + 5 JOIN + 5 aggregate/window |
| Churn/Risk Prediction | 15 | Customer profiles with known outcomes | 5 low-risk + 5 medium + 5 high-risk |
| Business Insights | 15 | Open-ended analytical questions | 5 comparison + 5 trend + 5 recommendation |

**Implementation:**

```python
# File: benchmark/prompts.jsonl
# Each line is a JSON object:
{
    "id": "ext-001",
    "task": "extraction",
    "difficulty": "hard",
    "prompt": "Analise esta avaliaΓ§Γ£o...",
    "system": "<your system prompt>",
    "ground_truth": {
        "sentiment": "negativo",
        "sentiment_score": -0.6,
        "churn_risk": 0.8,
        ...
    },
    "notes": "Mixed sentiment β€” product good but delivery terrible"
}
```

**Key principles:**
- Include edge cases: mixed sentiment, sarcasm, regional slang ("barato saiu caro"), incomplete reviews
- SQL prompts must have executable ground truth queries against your actual schema
- Insights prompts should require multi-step reasoning, not one-hop lookups

### Step 1.2: Build Automated Scoring Functions

Each task type gets its own scorer:

#### Extraction Scorer
```python
def score_extraction(prediction: dict, ground_truth: dict) -> dict:
    """Score each of the 10 JSON fields independently."""
    scores = {}
    
    # Categorical fields: exact match
    for field in ["sentiment", "complaint_category"]:
        scores[field] = 1.0 if pred[field] == gt[field] else 0.0
    
    # Numeric fields: distance-based
    for field in ["sentiment_score", "churn_risk", "repeat_intent"]:
        scores[field] = max(0, 1.0 - abs(pred[field] - gt[field]))
    
    # Boolean fields: exact match
    for field in ["delivery_issue", "product_issue", "seller_issue", "would_recommend"]:
        scores[field] = 1.0 if pred[field] == gt[field] else 0.0
    
    # Text fields: semantic similarity (sentence-transformers)
    for field in ["main_complaint"]:
        scores[field] = cosine_similarity(embed(pred[field]), embed(gt[field]))
    
    scores["mean"] = sum(scores.values()) / len(scores)
    return scores
```

#### SQL Scorer
```python
def score_sql(predicted_sql: str, ground_truth_sql: str, db_connection) -> dict:
    """Execute both queries and compare result sets."""
    try:
        pred_results = db_connection.execute(predicted_sql).fetchall()
        gt_results = db_connection.execute(ground_truth_sql).fetchall()
        
        # Execution accuracy (EX): do the result sets match?
        ex = 1.0 if set(pred_results) == set(gt_results) else 0.0
        
        # Partial credit: row overlap
        overlap = len(set(pred_results) & set(gt_results))
        partial = overlap / max(len(gt_results), 1)
        
        return {"ex": ex, "partial": partial, "syntax_valid": 1.0}
    except Exception:
        return {"ex": 0.0, "partial": 0.0, "syntax_valid": 0.0}
```

#### Sentiment Scorer
```python
def score_sentiment(prediction: str, ground_truth: str) -> dict:
    """Exact match on polarity, distance on score."""
    polarity_match = 1.0 if pred_polarity == gt_polarity else 0.0
    score_distance = max(0, 1.0 - abs(pred_score - gt_score) / 2.0)
    return {"polarity": polarity_match, "score": score_distance}
```

#### Churn Scorer
```python
def score_churn(prediction: float, ground_truth: float, threshold=0.5) -> dict:
    """Binary accuracy + calibration."""
    binary = 1.0 if (prediction >= threshold) == (ground_truth >= threshold) else 0.0
    calibration = max(0, 1.0 - abs(prediction - ground_truth))
    return {"binary_accuracy": binary, "calibration": calibration}
```

#### Insights Scorer (LLM-as-Judge)
```python
JUDGE_PROMPT = """You are evaluating a Brazilian e-commerce analysis response.
Rate on 4 dimensions (1-5 each):
1. Relevance: Does it address the question?
2. Accuracy: Are the claims factually reasonable?
3. Actionability: Could a business act on this analysis?
4. Portuguese Quality: Is the PT-BR natural and professional?

Response to evaluate:
{response}

Question asked:
{question}

Return JSON: {"relevance": X, "accuracy": X, "actionability": X, "portuguese": X}"""

def score_insights(response: str, question: str) -> dict:
    """Use GPT-4o as judge, run 3x and average."""
    scores = []
    for _ in range(3):
        result = openai.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": JUDGE_PROMPT.format(
                response=response, question=question
            )}],
            temperature=0.3
        )
        scores.append(json.loads(result.choices[0].message.content))
    
    # Average across 3 runs
    return {k: sum(s[k] for s in scores) / 3 for k in scores[0]}
```

### Step 1.3: Run the Benchmark Script

```python
# File: benchmark/run_benchmark.py
import json
from pathlib import Path

def run_benchmark(model, tokenizer, prompts_path, output_path):
    prompts = [json.loads(line) for line in open(prompts_path)]
    results = []
    
    for prompt in prompts:
        # Generate response
        messages = [
            {"role": "system", "content": prompt["system"]},
            {"role": "user", "content": prompt["prompt"]}
        ]
        response = generate(model, tokenizer, messages, max_new_tokens=2048, temperature=0.1)
        
        # Score based on task type
        scorer = SCORERS[prompt["task"]]
        score = scorer(parse_response(response), prompt.get("ground_truth"))
        
        results.append({
            "id": prompt["id"],
            "task": prompt["task"],
            "difficulty": prompt["difficulty"],
            "response": response,
            "scores": score,
            "tokens": len(tokenizer.encode(response))
        })
    
    # Save results
    with open(output_path, "w") as f:
        json.dump(results, f, indent=2, ensure_ascii=False)
    
    # Print summary
    for task in ["extraction", "sentiment", "sql", "churn", "insights"]:
        task_results = [r for r in results if r["task"] == task]
        mean_score = sum(r["scores"]["mean"] for r in task_results) / len(task_results)
        print(f"{task}: {mean_score:.3f} (n={len(task_results)})")

SCORERS = {
    "extraction": score_extraction,
    "sentiment": score_sentiment,
    "sql": score_sql,
    "churn": score_churn,
    "insights": score_insights,
}
```

### Step 1.4: Establish Baselines

Run the benchmark on:
1. **Qwen3-3.7B base** (zero-shot, no adapters) β€” this is your floor
2. **Tucano2-SFT** (SFT adapter only) β€” this is your pre-GRPO baseline
3. **Tucano2-GRPO v2** (current best) β€” this is what you're improving

Record all results in a comparison table.

---

## Phase 2: Run Comparison vs. Qwen3-35B-A3B

**Goal:** Prove domain-tuned 3.7B matches or beats general 35B on e-commerce tasks.  
**Time estimate:** 1 day (after benchmark is built)  
**Prerequisite:** Phase 1 complete

### Step 2.1: Set Up Qwen3-35B-A3B

This is a Mixture-of-Experts model (35B total, ~3B active per token). It should fit on your L4 with 4-bit quantization.

```python
from unsloth import FastLanguageModel

model_35b, tokenizer_35b = FastLanguageModel.from_pretrained(
    model_name="Qwen/Qwen3-35B-A3B",  # Verify exact HF repo name
    max_seq_length=4096,
    load_in_4bit=True,
    dtype=None,  # Auto-detect
)
FastLanguageModel.for_inference(model_35b)
```

**Memory estimate:** 35B params Γ— 0.5 bytes (4-bit) β‰ˆ 17.5GB. Active inference ~3B β‰ˆ 3GB compute. Should fit on L4 (24GB) with headroom.

**If it doesn't fit:** Use `transformers` with `BitsAndBytesConfig(load_in_4bit=True)` instead of Unsloth, or try GGUF via `llama.cpp`.

### Step 2.2: Run the Same Benchmark

```python
# Run identical benchmark on Qwen3-35B-A3B
run_benchmark(model_35b, tokenizer_35b, "benchmark/prompts.jsonl", "results/qwen3-35b.json")
```

**Critical:** Use the same system prompt, same temperature (0.1 for eval), same max_new_tokens. The only variable should be the model.

### Step 2.3: Optional β€” Add GPT-4o Baseline

```python
# For the strongest reference point
import openai

def run_benchmark_api(prompts_path, output_path, model="gpt-4o"):
    prompts = [json.loads(line) for line in open(prompts_path)]
    results = []
    
    for prompt in prompts:
        response = openai.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": prompt["system"]},
                {"role": "user", "content": prompt["prompt"]}
            ],
            temperature=0.1
        )
        # ... score same as above
```

### Step 2.4: Build the Comparison Report

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Task         β”‚ Qwen3-3.7B β”‚ Tucano2-SFTβ”‚ Tucano2-   β”‚ Qwen3-35B  β”‚ GPT-4o   β”‚
β”‚              β”‚ (base)     β”‚            β”‚ GRPO v2    β”‚ (zero-shot)β”‚(zero-shotβ”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Extraction   β”‚            β”‚            β”‚            β”‚            β”‚          β”‚
β”‚ Sentiment    β”‚            β”‚            β”‚            β”‚            β”‚          β”‚
β”‚ SQL          β”‚            β”‚            β”‚            β”‚            β”‚          β”‚
β”‚ Churn        β”‚            β”‚            β”‚            β”‚            β”‚          β”‚
β”‚ Insights     β”‚            β”‚            β”‚            β”‚            β”‚          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ MEAN         β”‚            β”‚            β”‚            β”‚            β”‚          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Cost/query   β”‚ $0.001     β”‚ $0.001     β”‚ $0.001     β”‚ $0.003     β”‚ $0.010   β”‚
β”‚ Latency (s)  β”‚            β”‚            β”‚            β”‚            β”‚          β”‚
β”‚ Privacy      β”‚ βœ… On-prem β”‚ βœ… On-prem β”‚ βœ… On-prem β”‚ βœ… On-prem β”‚ ❌ API   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

**The story to tell:**
- If Tucano2-GRPO β‰₯ Qwen3-35B on extraction/sentiment/SQL: "Domain tuning eliminates the need for 10Γ— larger models"
- If Tucano2-GRPO < Qwen3-35B on insights: "Open-ended reasoning still benefits from scale, but structured tasks don't"
- Cost column proves the business case regardless of performance parity

---

## Phase 3: GRPO v3 Training Run

**Goal:** Fix entropy collapse, break through the v2 performance plateau.  
**Time estimate:** 2-3 days (including data prep)  
**Prerequisite:** Phase 1 (benchmark needed to measure improvement)

### Step 3.1: Expand Training Data (1000+ prompts)

**Target: 1000 prompts** (up from 300), stratified:

| Task | Current | Target | How to Expand |
|------|---------|--------|--------------|
| Extraction | ~75 | 300 | Generate from Olist dataset β€” sample reviews, create ground truth JSON |
| Sentiment | ~60 | 200 | Sample from B2W reviews corpus, label with existing model + human review |
| SQL | ~75 | 250 | Template-based: vary table names, WHERE clauses, aggregation patterns |
| Churn | ~45 | 100 | Augment customer profiles with synthetic variations |
| Insights | ~45 | 150 | LLM-generated analytical questions about e-commerce scenarios |

**Synthetic generation recipe (from Cocktail Effect paper):**
```python
# Use your SFT model or GPT-4o to generate new training prompts
# Then manually verify/correct the ground truth
def generate_extraction_prompts(reviews_df, n=200):
    prompts = []
    for _, row in reviews_df.sample(n).iterrows():
        prompt = format_extraction_prompt(row["review_text"], row["score"], row["status"])
        # Generate ground truth with GPT-4o (higher quality than self-labeling)
        ground_truth = gpt4o_extract(prompt)
        prompts.append({"prompt": prompt, "ground_truth": ground_truth})
    return prompts
```

**Also add 30% general reasoning data** (Cocktail Effect paper finding):
- Source: Portuguese subset of Orca-Math or translated OpenOrca
- Purpose: Regularization β€” prevents model from overfitting to domain patterns
- Mix ratio: 700 domain + 300 general = 1000 total

### Step 3.2: Configuration Changes for v3

```python
# === GRPO v3 Config Changes ===

# FIX 1: Temperature β€” prevent entropy collapse
TEMPERATURE = 1.0  # Was 0.8 in v2. All published GRPO papers use 1.0.
# Reference: Skywork-OR1 (2505.22312) ablation shows Ο„=1.0 >> Ο„=0.6

# FIX 2: Completion length β€” remove the ceiling
MAX_COMPLETION_LENGTH = 4096  # Was 2048. Every v2 completion hit the ceiling.
# Trade-off: halve num_generations to fit VRAM

# FIX 3: Reduce generations to fit longer completions
NUM_GENERATIONS = 4  # Was 8. 4 Γ— 4096 β‰ˆ 8 Γ— 2048 in VRAM terms.
# MC-GRPO paper shows G=4 can work if using median baseline

# FIX 4: Explicit Ξ²=0 (no KL penalty)
# Dr. GRPO paper: Ξ²=0 is optimal for rule-based rewards
BETA = 0.0

# FIX 5: Learning rate β€” slightly more aggressive
LEARNING_RATE = 3e-6  # Was 2e-6. Clip ratios were all 0 β†’ room to push harder.
# Still well within published range (1e-6 to 5e-6)

# FIX 6: Max steps for expanded data
# 1000 prompts Γ— 4 generations / (4 batch Γ— 2 accum) = 500 steps/epoch
MAX_STEPS = 500

# FIX 7: Early stopping β€” more generous
EARLY_STOPPING_PATIENCE = 15  # 15 evals Γ— 10 steps = 150 steps of runway
EVAL_STEPS = 10
SAVE_STEPS = 10
```

### Step 3.3: Add Entropy Monitoring (Critical for v3)

Since TRL 0.24.0 doesn't have native entropy bonus, implement via callback:

```python
class EntropyMonitorCallback(TrainerCallback):
    """Monitor policy entropy to detect collapse early."""
    
    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs and "train/completion_length" in logs:
            # Proxy for entropy: if all completions are max length,
            # entropy is collapsing (model is stuck in one mode)
            completion_ratio = logs["train/completion_length"] / MAX_COMPLETION_LENGTH
            
            if completion_ratio > 0.95:
                print(f"⚠️ Step {state.global_step}: Completion ratio {completion_ratio:.2f} β€” "
                      f"possible entropy collapse. Monitor reward_std.")
            
            # Log to W&B
            if wandb.run:
                wandb.log({
                    "monitor/completion_ratio": completion_ratio,
                    "monitor/entropy_proxy": 1.0 - completion_ratio,
                }, step=state.global_step)
```

### Step 3.4: Add Zero-Advantage Group Filtering

```python
# In the reward function wrapper or custom trainer:
def filtered_commerce_reward_fn(completions, prompts, **kwargs):
    """Compute rewards and flag zero-variance groups for filtering."""
    rewards = commerce_reward_fn(completions, prompts, **kwargs)
    
    # Group rewards by prompt (each prompt has NUM_GENERATIONS completions)
    for i in range(0, len(rewards), NUM_GENERATIONS):
        group = rewards[i:i+NUM_GENERATIONS]
        if max(group) - min(group) < 0.01:  # Near-zero variance
            # Add small noise to break the tie
            # This prevents the GRPO denominator from exploding
            for j in range(i, i+NUM_GENERATIONS):
                rewards[j] += random.gauss(0, 0.005)
    
    return rewards
```

**Note:** The proper fix is MC-GRPO's median baseline, but that requires modifying TRL internals. The noise injection is a pragmatic workaround for TRL 0.24.0.

### Step 3.5: Reward Function Refinement

Split the composite reward into staged components (Reasoning-SQL paper finding):

```python
def commerce_reward_fn_v3(completions, prompts, **kwargs):
    """Multi-component reward with staged convergence."""
    rewards = []
    for completion, prompt in zip(completions, prompts):
        # Stage 1: Format reward (converges first)
        r_format = score_format(completion)  # JSON valid? Think tags closed? Right structure?
        
        # Stage 2: Partial content reward
        r_partial = score_partial_content(completion, prompt)  # Some fields correct?
        
        # Stage 3: Full task reward  
        r_task = score_full_task(completion, prompt)  # All fields correct? SQL executes?
        
        # Weighted combination β€” format weight decreases over training
        reward = 0.2 * r_format + 0.3 * r_partial + 0.5 * r_task
        rewards.append(reward)
    
    return rewards
```

### Step 3.6: VRAM Budget for v3

```
L4 GPU: 24GB total

Model (NF4):          ~3.5GB
KV Cache (4096 seq):  ~2.0GB
Activations:          ~4.0GB
Optimizer states:     ~3.0GB
Generations (4Γ—4096): ~8.0GB
─────────────────────────────
Estimated total:      ~20.5GB
Headroom:             ~3.5GB

βœ… Should fit. If OOM: reduce MAX_COMPLETION_LENGTH to 3072 first.
```

### Step 3.7: Training Execution

```bash
# Pre-flight checklist:
# βœ… Benchmark built and baselines recorded (Phase 1)
# βœ… 1000+ prompts prepared and validated
# βœ… Config changes applied (temperature, completion length, generations, LR)
# βœ… Entropy monitor callback added
# βœ… Zero-advantage filtering active
# βœ… Reward function v3 with staged components
# βœ… FRESH=True (nuke old checkpoints)
# βœ… W&B run name: grpo-v3-l4-{timestamp}

# Expected runtime: 500 steps Γ— ~5 min/step (longer completions) β‰ˆ 42 hours
# Checkpoints every 10 steps
# Early stopping patience: 15 evals (150 steps)
```

### Step 3.8: Post-Training Validation

1. Run Phase 1 benchmark on GRPO v3 best checkpoint
2. Compare against all baselines:
   - Qwen3-3.7B base β†’ Tucano2-SFT β†’ Tucano2-GRPO-v2 β†’ **Tucano2-GRPO-v3**
3. If v3 > v2 on benchmark: save as production model
4. If v3 β‰ˆ v2: entropy collapse not fully fixed; consider switching to MC-GRPO or upgrading GPU for longer completions
5. If v3 < v2: rollback, investigate β€” likely reward function regression

---

## Decision Criteria for Stopping

| Condition | Action |
|-----------|--------|
| v3 eval reward > 0.20 AND extraction score > 0.40 | Ship it β€” significantly better than v2 |
| v3 eval reward 0.15-0.20, improving trend | Run epoch 2 (extend MAX_STEPS to 1000) |
| v3 eval reward < v2 (0.125) | Stop, diagnose, review reward function and data |
| Entropy collapse again (clip_ratio=0 after step 50) | Add entropy bonus via custom loss (requires TRL fork) |
| OOM | Reduce MAX_COMPLETION_LENGTH to 3072 β†’ 2560 β†’ 2048 |

---

## Appendix: Literature References for Each Fix

| Fix | Paper | Section | Key Finding |
|-----|-------|---------|-------------|
| Temperature=1.0 | Skywork-OR1 (2505.22312) | Β§4, Table 3 | Ο„=1.0 gives 5-8% better performance, delays entropy collapse |
| Ξ²=0 (no KL) | Dr. GRPO (2503.20783) | Β§3.2 | KL penalty unnecessary for rule-based rewards |
| scale_rewards=False | Dr. GRPO (2503.20783) | Β§3.1 | Std normalization biases toward low-variance groups |
| Longer completions | Dr. GRPO (2503.20783) | Β§3.1 | GRPO length bias inflates wrong answers β†’ ceiling hit |
| Zero-advantage filtering | Skywork-OR1 (2505.22312) | Β§3.1 | Zero-std groups destabilize training |
| Staged rewards | Reasoning-SQL (2503.23157) | Β§3.2 | Format rewards converge first, enable task learning |
| General data mixing | Cocktail Effect (2410.01109) | Β§4 | 30% general data improves domain performance 2-15% |
| G=4 with median | MC-GRPO (2601.22582) | Β§3 | Median baseline reduces noise at small group sizes |