File size: 11,687 Bytes
0f39df7 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 | # V3 Patch: 3 Changes for Task-Aware Thinking Control
## Overview
These 3 changes go into the v3 notebook. Each change is a precise cell modification.
**Research basis:**
- OptimalThinkingBench (2508.13141): "Don't overthink" → -23% tokens, +7.7pp accuracy on Qwen3
- Mid-Think (2601.07036): task-specific thinking control in GRPO → +2.6pp AIME, -15% train time
- L1 (2503.04697): token budgets in prompts work when trained with RL reward signal
- User's proven extraction prompt: XML-tagged structure + few-shot + schema enforcement
---
## CHANGE 1: Replace SYSTEM_PT with task-aware system prompts (Cell 3)
### REMOVE this block from Cell 3:
```python
SYSTEM_PT = (
"Você é um assistente de IA especializado em análise de e-commerce brasileiro. "
"Você compreende avaliações de clientes em português e padrões de comércio brasileiro."
)
```
### REPLACE with:
```python
# ══════════════════════════════════════════════════════════════════════════════
# v3: TASK-AWARE SYSTEM PROMPTS
# ══════════════════════════════════════════════════════════════════════════════
# Research basis:
# - OptimalThinkingBench (2508.13141): "Don't overthink" → -23% tokens, +7.7pp accuracy on Qwen3
# - Mid-Think (2601.07036): task-specific thinking control in GRPO → +2.6pp AIME, -15% train time
# - L1 (2503.04697): token budgets in prompts work when trained with RL reward signal
# - User's proven extraction prompt: XML-tagged structure + few-shot + schema enforcement
COMPLAINT_CATEGORIES_STR = ", ".join(sorted(VALID_CATEGORIES))
SYSTEM_EXTRACTION = (
"Você é um motor de extração de dados de e-commerce brasileiro. "
"Retorne APENAS um objeto JSON válido, sem nenhum texto antes ou depois. "
"NÃO USE blocos de código markdown (` `` json). "
"O primeiro caractere da sua resposta deve ser { e o último deve ser }. "
"Campos não mencionados na avaliação devem ser null — nunca invente valores. "
"Sem explicação. Sem comentários. Não pense em excesso."
)
SYSTEM_SQL = (
"Você é um assistente de IA especializado em análise de e-commerce brasileiro. "
"Você compreende avaliações de clientes em português e padrões de comércio brasileiro.\n\n"
"Para consultas e análises de dados: pense brevemente sobre a estrutura necessária, "
"depois apresente a resposta de forma direta com números e dados concretos. "
"Seja conciso no raciocínio. Não pense em excesso."
)
SYSTEM_INSIGHTS = (
"Você é um assistente de IA especializado em análise de e-commerce brasileiro. "
"Você compreende avaliações de clientes em português e padrões de comércio brasileiro.\n\n"
"Para análises estratégicas: raciocine de forma estruturada e concisa, "
"focando nos pontos principais e recomendações acionáveis. "
"Use no máximo 500 tokens para raciocinar antes de responder."
)
SYSTEM_PUSH = (
"Você é um assistente de IA especializado em análise de e-commerce brasileiro. "
"Você compreende avaliações de clientes em português e padrões de comércio brasileiro.\n\n"
"Para notificações push: seja direto e criativo. "
"A notificação deve ter no máximo 120 caracteres. "
"Responda diretamente sem pensar em excesso."
)
# Legacy fallback — used only in cells that don't have task context
SYSTEM_PT = (
"Você é um assistente de IA especializado em análise de e-commerce brasileiro. "
"Você compreende avaliações de clientes em português e padrões de comércio brasileiro."
)
def get_system_prompt(task_type: str) -> str:
"""Return task-optimized system prompt."""
return {
"extraction": SYSTEM_EXTRACTION,
"sql_qa": SYSTEM_SQL,
"insights": SYSTEM_INSIGHTS,
"push": SYSTEM_PUSH,
}.get(task_type, SYSTEM_PT)
# ── Think token budgets per task (for reward function) ────────────────────────
# These are soft targets — the reward function nudges, not enforces
THINK_BUDGETS = {
"extraction": 150, # Extraction barely needs thinking — pattern matching
"push": 100, # Push is creative writing, not reasoning
"sql_qa": 400, # SQL benefits from brief query planning
"insights": 800, # Insights need structured multi-step analysis
}
print("✓ v3 Task-aware system prompts defined")
print(f" extraction: '{SYSTEM_EXTRACTION[:60]}...'")
print(f" sql_qa: '{SYSTEM_SQL[:60]}...'")
print(f" insights: '{SYSTEM_INSIGHTS[:60]}...'")
print(f" push: '{SYSTEM_PUSH[:60]}...'")
```
---
## CHANGE 2: Add reward_think_efficiency() to Cell 6 (Reward Functions)
### ADD this function right before `commerce_reward_fn` in Cell 6:
```python
def reward_think_efficiency(completion: str, task_type: str) -> float:
"""
Reward concise thinking, penalize bloated <think> blocks.
v3 NEW — Research basis:
- OptimalThinkingBench (2508.13141): overthinking hurts accuracy on simple tasks
- L1 (2503.04697): token budget rewards teach models to control reasoning length
- Train Long Think Short (2508.08940): triangular length reward around target budget
Returns: -0.05 to +0.1 (small component — nudge, not dominate)
"""
think_match = re.search(r"<think>(.*?)</think>", completion, re.DOTALL)
budget = THINK_BUDGETS.get(task_type, 500)
if not think_match:
# No think block at all
if task_type in ("extraction", "push"):
return 0.1 # Great — these tasks don't need thinking
else:
return 0.0 # Neutral for analytical tasks
think_content = think_match.group(1).strip()
think_chars = len(think_content) # chars as proxy (cheaper than tokenizing)
# Rough conversion: ~4 chars per token for Portuguese
think_tokens_approx = think_chars / 4
if think_tokens_approx <= budget:
# Within budget — reward proportional to how concise
return 0.1
elif think_tokens_approx <= budget * 2:
# Over budget but not catastrophic — linear decay
overshoot = (think_tokens_approx - budget) / budget
return 0.1 * (1.0 - overshoot) # 0.1 → 0.0
else:
# Way over budget — mild penalty
return -0.05
```
### MODIFY `commerce_reward_fn` dispatch block:
**Current code (REMOVE):**
```python
if task == "extraction":
rewards.append(reward_extraction(comp_text))
elif task == "sql_qa":
rewards.append(reward_sql_qa(comp_text))
elif task == "insights":
rewards.append(reward_insights(comp_text))
elif task == "push":
rewards.append(reward_push(comp_text))
else:
r = 0.15 if has_think_block(comp_text) else 0.0
r += 0.2 if comp_text.strip() else 0.0
rewards.append(r)
```
**New code (REPLACE WITH):**
```python
if task == "extraction":
task_r = reward_extraction(comp_text)
elif task == "sql_qa":
task_r = reward_sql_qa(comp_text)
elif task == "insights":
task_r = reward_insights(comp_text)
elif task == "push":
task_r = reward_push(comp_text)
else:
task_r = 0.15 if has_think_block(comp_text) else 0.0
task_r += 0.2 if comp_text.strip() else 0.0
# v3: Think efficiency bonus/penalty (small weight — nudge, not dominate)
think_r = reward_think_efficiency(comp_text, task)
rewards.append(task_r + think_r)
```
---
## CHANGE 3: Wire system prompts into data preparation and eval
### Cell 7 (Calibration) — add helper + use in loop:
Add this helper function after loading `by_type`:
```python
def inject_task_system_prompt(msgs, task_type):
"""Replace generic system prompt with task-specific one."""
new_msgs = []
system_prompt = get_system_prompt(task_type)
has_system = False
for m in msgs:
if m["role"] == "system":
new_msgs.append({"role": "system", "content": system_prompt})
has_system = True
else:
new_msgs.append(m)
if not has_system:
new_msgs.insert(0, {"role": "system", "content": system_prompt})
return new_msgs
```
Then in the calibration loop, inject the task-aware prompt before template application:
```python
for i, msgs in enumerate(cal_samples):
# Determine task type from user content
user_text = " ".join(m["content"] for m in msgs if m["role"] == "user")
task = _classify_task_type(user_text)
# v3: Inject task-aware system prompt
msgs = inject_task_system_prompt(msgs, task)
text = tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
# ... rest of loop unchanged
```
### Cell 8 (Dataset Preparation) — inject into train/eval records:
In `prepare_grpo_datasets_v3`, after building train_records and eval_records (before creating HF Datasets), add:
```python
# v3: Inject task-aware system prompts into each training record
for i, record in enumerate(train_records):
user_text = " ".join(m["content"] for m in record if m["role"] == "user")
task = _classify_task_type(user_text)
train_records[i] = inject_task_system_prompt(record, task)
# Same for eval records
for i, record in enumerate(eval_records):
user_text = " ".join(m["content"] for m in record if m["role"] == "user")
task = _classify_task_type(user_text)
eval_records[i] = inject_task_system_prompt(record, task)
print(f" ✓ Task-aware system prompts injected")
```
### Cell 11 (EvalRewardCallback) — no change needed:
System prompts were injected in Cell 8, so eval data already has the right prompts.
### Cell 13 (Validation) — use task-aware selection:
Replace:
```python
system_msg = {"role": "system", "content": SYSTEM_PT}
```
With task-aware selection inside the loop:
```python
# REMOVE the fixed system_msg line above the loop
# Inside the loop, before generating:
task = _classify_task_type(prompt["content"])
system_msg = {"role": "system", "content": get_system_prompt(task)}
messages = [system_msg, prompt]
```
---
## Summary
| Cell | What changes | Lines affected |
|------|-------------|---------------|
| Cell 3 | Replace `SYSTEM_PT` with 4 task prompts + `get_system_prompt()` + `THINK_BUDGETS` | ~50 lines added |
| Cell 6 | Add `reward_think_efficiency()`, modify `commerce_reward_fn` dispatch | ~35 lines added, ~10 modified |
| Cell 7 | Add `inject_task_system_prompt()`, use in calibration loop | ~15 lines added |
| Cell 8 | Inject task-aware system prompts into train/eval records | ~10 lines added |
| Cell 13 | Use `get_system_prompt(task)` instead of fixed `SYSTEM_PT` | ~3 lines modified |
## Expected impact
| Task | Current think tokens | Expected after patch | Mechanism |
|------|---------------------|---------------------|-----------|
| Extraction | 2000-3000 (100% ceiling) | ~300-800 (-60-70%) | "Não pense em excesso" + think penalty reward |
| Push | 1000-2000 | ~100-300 (-70-80%) | "Responda diretamente" + think penalty reward |
| SQL Q&A | 1500-2500 | ~400-800 (-50%) | "Seja conciso no raciocínio" + think budget reward |
| Insights | 2000-3200 (ceiling) | ~800-1500 (-30-40%) | "Use no máximo 500 tokens" + higher think budget |
|