File size: 11,687 Bytes
0f39df7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
# V3 Patch: 3 Changes for Task-Aware Thinking Control

## Overview
These 3 changes go into the v3 notebook. Each change is a precise cell modification.

**Research basis:**
- OptimalThinkingBench (2508.13141): "Don't overthink" → -23% tokens, +7.7pp accuracy on Qwen3
- Mid-Think (2601.07036): task-specific thinking control in GRPO → +2.6pp AIME, -15% train time
- L1 (2503.04697): token budgets in prompts work when trained with RL reward signal
- User's proven extraction prompt: XML-tagged structure + few-shot + schema enforcement

---

## CHANGE 1: Replace SYSTEM_PT with task-aware system prompts (Cell 3)

### REMOVE this block from Cell 3:
```python
SYSTEM_PT = (
    "Você é um assistente de IA especializado em análise de e-commerce brasileiro. "
    "Você compreende avaliações de clientes em português e padrões de comércio brasileiro."
)
```

### REPLACE with:
```python
# ══════════════════════════════════════════════════════════════════════════════
# v3: TASK-AWARE SYSTEM PROMPTS
# ══════════════════════════════════════════════════════════════════════════════
# Research basis:
#   - OptimalThinkingBench (2508.13141): "Don't overthink" → -23% tokens, +7.7pp accuracy on Qwen3
#   - Mid-Think (2601.07036): task-specific thinking control in GRPO → +2.6pp AIME, -15% train time
#   - L1 (2503.04697): token budgets in prompts work when trained with RL reward signal
#   - User's proven extraction prompt: XML-tagged structure + few-shot + schema enforcement

COMPLAINT_CATEGORIES_STR = ", ".join(sorted(VALID_CATEGORIES))

SYSTEM_EXTRACTION = (
    "Você é um motor de extração de dados de e-commerce brasileiro. "
    "Retorne APENAS um objeto JSON válido, sem nenhum texto antes ou depois. "
    "NÃO USE blocos de código markdown (` `` json). "
    "O primeiro caractere da sua resposta deve ser { e o último deve ser }. "
    "Campos não mencionados na avaliação devem ser null — nunca invente valores. "
    "Sem explicação. Sem comentários. Não pense em excesso."
)

SYSTEM_SQL = (
    "Você é um assistente de IA especializado em análise de e-commerce brasileiro. "
    "Você compreende avaliações de clientes em português e padrões de comércio brasileiro.\n\n"
    "Para consultas e análises de dados: pense brevemente sobre a estrutura necessária, "
    "depois apresente a resposta de forma direta com números e dados concretos. "
    "Seja conciso no raciocínio. Não pense em excesso."
)

SYSTEM_INSIGHTS = (
    "Você é um assistente de IA especializado em análise de e-commerce brasileiro. "
    "Você compreende avaliações de clientes em português e padrões de comércio brasileiro.\n\n"
    "Para análises estratégicas: raciocine de forma estruturada e concisa, "
    "focando nos pontos principais e recomendações acionáveis. "
    "Use no máximo 500 tokens para raciocinar antes de responder."
)

SYSTEM_PUSH = (
    "Você é um assistente de IA especializado em análise de e-commerce brasileiro. "
    "Você compreende avaliações de clientes em português e padrões de comércio brasileiro.\n\n"
    "Para notificações push: seja direto e criativo. "
    "A notificação deve ter no máximo 120 caracteres. "
    "Responda diretamente sem pensar em excesso."
)

# Legacy fallback — used only in cells that don't have task context
SYSTEM_PT = (
    "Você é um assistente de IA especializado em análise de e-commerce brasileiro. "
    "Você compreende avaliações de clientes em português e padrões de comércio brasileiro."
)

def get_system_prompt(task_type: str) -> str:
    """Return task-optimized system prompt."""
    return {
        "extraction": SYSTEM_EXTRACTION,
        "sql_qa": SYSTEM_SQL,
        "insights": SYSTEM_INSIGHTS,
        "push": SYSTEM_PUSH,
    }.get(task_type, SYSTEM_PT)

# ── Think token budgets per task (for reward function) ────────────────────────
# These are soft targets — the reward function nudges, not enforces
THINK_BUDGETS = {
    "extraction": 150,   # Extraction barely needs thinking — pattern matching
    "push": 100,         # Push is creative writing, not reasoning
    "sql_qa": 400,       # SQL benefits from brief query planning
    "insights": 800,     # Insights need structured multi-step analysis
}

print("✓ v3 Task-aware system prompts defined")
print(f"  extraction: '{SYSTEM_EXTRACTION[:60]}...'")
print(f"  sql_qa:     '{SYSTEM_SQL[:60]}...'")
print(f"  insights:   '{SYSTEM_INSIGHTS[:60]}...'")
print(f"  push:       '{SYSTEM_PUSH[:60]}...'")
```

---

## CHANGE 2: Add reward_think_efficiency() to Cell 6 (Reward Functions)

### ADD this function right before `commerce_reward_fn` in Cell 6:

```python
def reward_think_efficiency(completion: str, task_type: str) -> float:
    """
    Reward concise thinking, penalize bloated <think> blocks.
    
    v3 NEW — Research basis:
      - OptimalThinkingBench (2508.13141): overthinking hurts accuracy on simple tasks
      - L1 (2503.04697): token budget rewards teach models to control reasoning length
      - Train Long Think Short (2508.08940): triangular length reward around target budget
    
    Returns: -0.05 to +0.1 (small component — nudge, not dominate)
    """
    think_match = re.search(r"<think>(.*?)</think>", completion, re.DOTALL)
    budget = THINK_BUDGETS.get(task_type, 500)
    
    if not think_match:
        # No think block at all
        if task_type in ("extraction", "push"):
            return 0.1   # Great — these tasks don't need thinking
        else:
            return 0.0   # Neutral for analytical tasks
    
    think_content = think_match.group(1).strip()
    think_chars = len(think_content)  # chars as proxy (cheaper than tokenizing)
    # Rough conversion: ~4 chars per token for Portuguese
    think_tokens_approx = think_chars / 4
    
    if think_tokens_approx <= budget:
        # Within budget — reward proportional to how concise
        return 0.1
    elif think_tokens_approx <= budget * 2:
        # Over budget but not catastrophic — linear decay
        overshoot = (think_tokens_approx - budget) / budget
        return 0.1 * (1.0 - overshoot)  # 0.1 → 0.0
    else:
        # Way over budget — mild penalty
        return -0.05
```

### MODIFY `commerce_reward_fn` dispatch block:

**Current code (REMOVE):**
```python
        if task == "extraction":
            rewards.append(reward_extraction(comp_text))
        elif task == "sql_qa":
            rewards.append(reward_sql_qa(comp_text))
        elif task == "insights":
            rewards.append(reward_insights(comp_text))
        elif task == "push":
            rewards.append(reward_push(comp_text))
        else:
            r = 0.15 if has_think_block(comp_text) else 0.0
            r += 0.2 if comp_text.strip() else 0.0
            rewards.append(r)
```

**New code (REPLACE WITH):**
```python
        if task == "extraction":
            task_r = reward_extraction(comp_text)
        elif task == "sql_qa":
            task_r = reward_sql_qa(comp_text)
        elif task == "insights":
            task_r = reward_insights(comp_text)
        elif task == "push":
            task_r = reward_push(comp_text)
        else:
            task_r = 0.15 if has_think_block(comp_text) else 0.0
            task_r += 0.2 if comp_text.strip() else 0.0

        # v3: Think efficiency bonus/penalty (small weight — nudge, not dominate)
        think_r = reward_think_efficiency(comp_text, task)
        rewards.append(task_r + think_r)
```

---

## CHANGE 3: Wire system prompts into data preparation and eval

### Cell 7 (Calibration) — add helper + use in loop:

Add this helper function after loading `by_type`:
```python
def inject_task_system_prompt(msgs, task_type):
    """Replace generic system prompt with task-specific one."""
    new_msgs = []
    system_prompt = get_system_prompt(task_type)
    has_system = False
    for m in msgs:
        if m["role"] == "system":
            new_msgs.append({"role": "system", "content": system_prompt})
            has_system = True
        else:
            new_msgs.append(m)
    if not has_system:
        new_msgs.insert(0, {"role": "system", "content": system_prompt})
    return new_msgs
```

Then in the calibration loop, inject the task-aware prompt before template application:
```python
for i, msgs in enumerate(cal_samples):
    # Determine task type from user content
    user_text = " ".join(m["content"] for m in msgs if m["role"] == "user")
    task = _classify_task_type(user_text)
    
    # v3: Inject task-aware system prompt
    msgs = inject_task_system_prompt(msgs, task)
    
    text = tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
    # ... rest of loop unchanged
```

### Cell 8 (Dataset Preparation) — inject into train/eval records:

In `prepare_grpo_datasets_v3`, after building train_records and eval_records (before creating HF Datasets), add:
```python
    # v3: Inject task-aware system prompts into each training record
    for i, record in enumerate(train_records):
        user_text = " ".join(m["content"] for m in record if m["role"] == "user")
        task = _classify_task_type(user_text)
        train_records[i] = inject_task_system_prompt(record, task)
    
    # Same for eval records
    for i, record in enumerate(eval_records):
        user_text = " ".join(m["content"] for m in record if m["role"] == "user")
        task = _classify_task_type(user_text)
        eval_records[i] = inject_task_system_prompt(record, task)
    
    print(f"  ✓ Task-aware system prompts injected")
```

### Cell 11 (EvalRewardCallback) — no change needed:
System prompts were injected in Cell 8, so eval data already has the right prompts.

### Cell 13 (Validation) — use task-aware selection:

Replace:
```python
system_msg = {"role": "system", "content": SYSTEM_PT}
```

With task-aware selection inside the loop:
```python
# REMOVE the fixed system_msg line above the loop

# Inside the loop, before generating:
    task = _classify_task_type(prompt["content"])
    system_msg = {"role": "system", "content": get_system_prompt(task)}
    messages = [system_msg, prompt]
```

---

## Summary

| Cell | What changes | Lines affected |
|------|-------------|---------------|
| Cell 3 | Replace `SYSTEM_PT` with 4 task prompts + `get_system_prompt()` + `THINK_BUDGETS` | ~50 lines added |
| Cell 6 | Add `reward_think_efficiency()`, modify `commerce_reward_fn` dispatch | ~35 lines added, ~10 modified |
| Cell 7 | Add `inject_task_system_prompt()`, use in calibration loop | ~15 lines added |
| Cell 8 | Inject task-aware system prompts into train/eval records | ~10 lines added |
| Cell 13 | Use `get_system_prompt(task)` instead of fixed `SYSTEM_PT` | ~3 lines modified |

## Expected impact

| Task | Current think tokens | Expected after patch | Mechanism |
|------|---------------------|---------------------|-----------|
| Extraction | 2000-3000 (100% ceiling) | ~300-800 (-60-70%) | "Não pense em excesso" + think penalty reward |
| Push | 1000-2000 | ~100-300 (-70-80%) | "Responda diretamente" + think penalty reward |
| SQL Q&A | 1500-2500 | ~400-800 (-50%) | "Seja conciso no raciocínio" + think budget reward |
| Insights | 2000-3200 (ceiling) | ~800-1500 (-30-40%) | "Use no máximo 500 tokens" + higher think budget |