rtferraz commited on
Commit
0f39df7
Β·
verified Β·
1 Parent(s): fa4a874

Add v3 thinking control patch - task-aware system prompts + think efficiency reward

Browse files
Files changed (1) hide show
  1. docs/v3_thinking_control_patch.md +279 -0
docs/v3_thinking_control_patch.md ADDED
@@ -0,0 +1,279 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # V3 Patch: 3 Changes for Task-Aware Thinking Control
2
+
3
+ ## Overview
4
+ These 3 changes go into the v3 notebook. Each change is a precise cell modification.
5
+
6
+ **Research basis:**
7
+ - OptimalThinkingBench (2508.13141): "Don't overthink" β†’ -23% tokens, +7.7pp accuracy on Qwen3
8
+ - Mid-Think (2601.07036): task-specific thinking control in GRPO β†’ +2.6pp AIME, -15% train time
9
+ - L1 (2503.04697): token budgets in prompts work when trained with RL reward signal
10
+ - User's proven extraction prompt: XML-tagged structure + few-shot + schema enforcement
11
+
12
+ ---
13
+
14
+ ## CHANGE 1: Replace SYSTEM_PT with task-aware system prompts (Cell 3)
15
+
16
+ ### REMOVE this block from Cell 3:
17
+ ```python
18
+ SYSTEM_PT = (
19
+ "VocΓͺ Γ© um assistente de IA especializado em anΓ‘lise de e-commerce brasileiro. "
20
+ "VocΓͺ compreende avaliaΓ§Γ΅es de clientes em portuguΓͺs e padrΓ΅es de comΓ©rcio brasileiro."
21
+ )
22
+ ```
23
+
24
+ ### REPLACE with:
25
+ ```python
26
+ # ══════════════════════════════════════════════════════════════════════════════
27
+ # v3: TASK-AWARE SYSTEM PROMPTS
28
+ # ══════════════════════════════════════════════════════════════════════════════
29
+ # Research basis:
30
+ # - OptimalThinkingBench (2508.13141): "Don't overthink" β†’ -23% tokens, +7.7pp accuracy on Qwen3
31
+ # - Mid-Think (2601.07036): task-specific thinking control in GRPO β†’ +2.6pp AIME, -15% train time
32
+ # - L1 (2503.04697): token budgets in prompts work when trained with RL reward signal
33
+ # - User's proven extraction prompt: XML-tagged structure + few-shot + schema enforcement
34
+
35
+ COMPLAINT_CATEGORIES_STR = ", ".join(sorted(VALID_CATEGORIES))
36
+
37
+ SYSTEM_EXTRACTION = (
38
+ "VocΓͺ Γ© um motor de extraΓ§Γ£o de dados de e-commerce brasileiro. "
39
+ "Retorne APENAS um objeto JSON vΓ‘lido, sem nenhum texto antes ou depois. "
40
+ "NÃO USE blocos de código markdown (` `` json). "
41
+ "O primeiro caractere da sua resposta deve ser { e o ΓΊltimo deve ser }. "
42
+ "Campos nΓ£o mencionados na avaliaΓ§Γ£o devem ser null β€” nunca invente valores. "
43
+ "Sem explicaΓ§Γ£o. Sem comentΓ‘rios. NΓ£o pense em excesso."
44
+ )
45
+
46
+ SYSTEM_SQL = (
47
+ "VocΓͺ Γ© um assistente de IA especializado em anΓ‘lise de e-commerce brasileiro. "
48
+ "VocΓͺ compreende avaliaΓ§Γ΅es de clientes em portuguΓͺs e padrΓ΅es de comΓ©rcio brasileiro.\n\n"
49
+ "Para consultas e anΓ‘lises de dados: pense brevemente sobre a estrutura necessΓ‘ria, "
50
+ "depois apresente a resposta de forma direta com nΓΊmeros e dados concretos. "
51
+ "Seja conciso no raciocΓ­nio. NΓ£o pense em excesso."
52
+ )
53
+
54
+ SYSTEM_INSIGHTS = (
55
+ "VocΓͺ Γ© um assistente de IA especializado em anΓ‘lise de e-commerce brasileiro. "
56
+ "VocΓͺ compreende avaliaΓ§Γ΅es de clientes em portuguΓͺs e padrΓ΅es de comΓ©rcio brasileiro.\n\n"
57
+ "Para anΓ‘lises estratΓ©gicas: raciocine de forma estruturada e concisa, "
58
+ "focando nos pontos principais e recomendaΓ§Γ΅es acionΓ‘veis. "
59
+ "Use no mΓ‘ximo 500 tokens para raciocinar antes de responder."
60
+ )
61
+
62
+ SYSTEM_PUSH = (
63
+ "VocΓͺ Γ© um assistente de IA especializado em anΓ‘lise de e-commerce brasileiro. "
64
+ "VocΓͺ compreende avaliaΓ§Γ΅es de clientes em portuguΓͺs e padrΓ΅es de comΓ©rcio brasileiro.\n\n"
65
+ "Para notificaΓ§Γ΅es push: seja direto e criativo. "
66
+ "A notificaΓ§Γ£o deve ter no mΓ‘ximo 120 caracteres. "
67
+ "Responda diretamente sem pensar em excesso."
68
+ )
69
+
70
+ # Legacy fallback β€” used only in cells that don't have task context
71
+ SYSTEM_PT = (
72
+ "VocΓͺ Γ© um assistente de IA especializado em anΓ‘lise de e-commerce brasileiro. "
73
+ "VocΓͺ compreende avaliaΓ§Γ΅es de clientes em portuguΓͺs e padrΓ΅es de comΓ©rcio brasileiro."
74
+ )
75
+
76
+ def get_system_prompt(task_type: str) -> str:
77
+ """Return task-optimized system prompt."""
78
+ return {
79
+ "extraction": SYSTEM_EXTRACTION,
80
+ "sql_qa": SYSTEM_SQL,
81
+ "insights": SYSTEM_INSIGHTS,
82
+ "push": SYSTEM_PUSH,
83
+ }.get(task_type, SYSTEM_PT)
84
+
85
+ # ── Think token budgets per task (for reward function) ────────────────────────
86
+ # These are soft targets β€” the reward function nudges, not enforces
87
+ THINK_BUDGETS = {
88
+ "extraction": 150, # Extraction barely needs thinking β€” pattern matching
89
+ "push": 100, # Push is creative writing, not reasoning
90
+ "sql_qa": 400, # SQL benefits from brief query planning
91
+ "insights": 800, # Insights need structured multi-step analysis
92
+ }
93
+
94
+ print("βœ“ v3 Task-aware system prompts defined")
95
+ print(f" extraction: '{SYSTEM_EXTRACTION[:60]}...'")
96
+ print(f" sql_qa: '{SYSTEM_SQL[:60]}...'")
97
+ print(f" insights: '{SYSTEM_INSIGHTS[:60]}...'")
98
+ print(f" push: '{SYSTEM_PUSH[:60]}...'")
99
+ ```
100
+
101
+ ---
102
+
103
+ ## CHANGE 2: Add reward_think_efficiency() to Cell 6 (Reward Functions)
104
+
105
+ ### ADD this function right before `commerce_reward_fn` in Cell 6:
106
+
107
+ ```python
108
+ def reward_think_efficiency(completion: str, task_type: str) -> float:
109
+ """
110
+ Reward concise thinking, penalize bloated <think> blocks.
111
+
112
+ v3 NEW β€” Research basis:
113
+ - OptimalThinkingBench (2508.13141): overthinking hurts accuracy on simple tasks
114
+ - L1 (2503.04697): token budget rewards teach models to control reasoning length
115
+ - Train Long Think Short (2508.08940): triangular length reward around target budget
116
+
117
+ Returns: -0.05 to +0.1 (small component β€” nudge, not dominate)
118
+ """
119
+ think_match = re.search(r"<think>(.*?)</think>", completion, re.DOTALL)
120
+ budget = THINK_BUDGETS.get(task_type, 500)
121
+
122
+ if not think_match:
123
+ # No think block at all
124
+ if task_type in ("extraction", "push"):
125
+ return 0.1 # Great β€” these tasks don't need thinking
126
+ else:
127
+ return 0.0 # Neutral for analytical tasks
128
+
129
+ think_content = think_match.group(1).strip()
130
+ think_chars = len(think_content) # chars as proxy (cheaper than tokenizing)
131
+ # Rough conversion: ~4 chars per token for Portuguese
132
+ think_tokens_approx = think_chars / 4
133
+
134
+ if think_tokens_approx <= budget:
135
+ # Within budget β€” reward proportional to how concise
136
+ return 0.1
137
+ elif think_tokens_approx <= budget * 2:
138
+ # Over budget but not catastrophic β€” linear decay
139
+ overshoot = (think_tokens_approx - budget) / budget
140
+ return 0.1 * (1.0 - overshoot) # 0.1 β†’ 0.0
141
+ else:
142
+ # Way over budget β€” mild penalty
143
+ return -0.05
144
+ ```
145
+
146
+ ### MODIFY `commerce_reward_fn` dispatch block:
147
+
148
+ **Current code (REMOVE):**
149
+ ```python
150
+ if task == "extraction":
151
+ rewards.append(reward_extraction(comp_text))
152
+ elif task == "sql_qa":
153
+ rewards.append(reward_sql_qa(comp_text))
154
+ elif task == "insights":
155
+ rewards.append(reward_insights(comp_text))
156
+ elif task == "push":
157
+ rewards.append(reward_push(comp_text))
158
+ else:
159
+ r = 0.15 if has_think_block(comp_text) else 0.0
160
+ r += 0.2 if comp_text.strip() else 0.0
161
+ rewards.append(r)
162
+ ```
163
+
164
+ **New code (REPLACE WITH):**
165
+ ```python
166
+ if task == "extraction":
167
+ task_r = reward_extraction(comp_text)
168
+ elif task == "sql_qa":
169
+ task_r = reward_sql_qa(comp_text)
170
+ elif task == "insights":
171
+ task_r = reward_insights(comp_text)
172
+ elif task == "push":
173
+ task_r = reward_push(comp_text)
174
+ else:
175
+ task_r = 0.15 if has_think_block(comp_text) else 0.0
176
+ task_r += 0.2 if comp_text.strip() else 0.0
177
+
178
+ # v3: Think efficiency bonus/penalty (small weight β€” nudge, not dominate)
179
+ think_r = reward_think_efficiency(comp_text, task)
180
+ rewards.append(task_r + think_r)
181
+ ```
182
+
183
+ ---
184
+
185
+ ## CHANGE 3: Wire system prompts into data preparation and eval
186
+
187
+ ### Cell 7 (Calibration) β€” add helper + use in loop:
188
+
189
+ Add this helper function after loading `by_type`:
190
+ ```python
191
+ def inject_task_system_prompt(msgs, task_type):
192
+ """Replace generic system prompt with task-specific one."""
193
+ new_msgs = []
194
+ system_prompt = get_system_prompt(task_type)
195
+ has_system = False
196
+ for m in msgs:
197
+ if m["role"] == "system":
198
+ new_msgs.append({"role": "system", "content": system_prompt})
199
+ has_system = True
200
+ else:
201
+ new_msgs.append(m)
202
+ if not has_system:
203
+ new_msgs.insert(0, {"role": "system", "content": system_prompt})
204
+ return new_msgs
205
+ ```
206
+
207
+ Then in the calibration loop, inject the task-aware prompt before template application:
208
+ ```python
209
+ for i, msgs in enumerate(cal_samples):
210
+ # Determine task type from user content
211
+ user_text = " ".join(m["content"] for m in msgs if m["role"] == "user")
212
+ task = _classify_task_type(user_text)
213
+
214
+ # v3: Inject task-aware system prompt
215
+ msgs = inject_task_system_prompt(msgs, task)
216
+
217
+ text = tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
218
+ # ... rest of loop unchanged
219
+ ```
220
+
221
+ ### Cell 8 (Dataset Preparation) β€” inject into train/eval records:
222
+
223
+ In `prepare_grpo_datasets_v3`, after building train_records and eval_records (before creating HF Datasets), add:
224
+ ```python
225
+ # v3: Inject task-aware system prompts into each training record
226
+ for i, record in enumerate(train_records):
227
+ user_text = " ".join(m["content"] for m in record if m["role"] == "user")
228
+ task = _classify_task_type(user_text)
229
+ train_records[i] = inject_task_system_prompt(record, task)
230
+
231
+ # Same for eval records
232
+ for i, record in enumerate(eval_records):
233
+ user_text = " ".join(m["content"] for m in record if m["role"] == "user")
234
+ task = _classify_task_type(user_text)
235
+ eval_records[i] = inject_task_system_prompt(record, task)
236
+
237
+ print(f" βœ“ Task-aware system prompts injected")
238
+ ```
239
+
240
+ ### Cell 11 (EvalRewardCallback) β€” no change needed:
241
+ System prompts were injected in Cell 8, so eval data already has the right prompts.
242
+
243
+ ### Cell 13 (Validation) β€” use task-aware selection:
244
+
245
+ Replace:
246
+ ```python
247
+ system_msg = {"role": "system", "content": SYSTEM_PT}
248
+ ```
249
+
250
+ With task-aware selection inside the loop:
251
+ ```python
252
+ # REMOVE the fixed system_msg line above the loop
253
+
254
+ # Inside the loop, before generating:
255
+ task = _classify_task_type(prompt["content"])
256
+ system_msg = {"role": "system", "content": get_system_prompt(task)}
257
+ messages = [system_msg, prompt]
258
+ ```
259
+
260
+ ---
261
+
262
+ ## Summary
263
+
264
+ | Cell | What changes | Lines affected |
265
+ |------|-------------|---------------|
266
+ | Cell 3 | Replace `SYSTEM_PT` with 4 task prompts + `get_system_prompt()` + `THINK_BUDGETS` | ~50 lines added |
267
+ | Cell 6 | Add `reward_think_efficiency()`, modify `commerce_reward_fn` dispatch | ~35 lines added, ~10 modified |
268
+ | Cell 7 | Add `inject_task_system_prompt()`, use in calibration loop | ~15 lines added |
269
+ | Cell 8 | Inject task-aware system prompts into train/eval records | ~10 lines added |
270
+ | Cell 13 | Use `get_system_prompt(task)` instead of fixed `SYSTEM_PT` | ~3 lines modified |
271
+
272
+ ## Expected impact
273
+
274
+ | Task | Current think tokens | Expected after patch | Mechanism |
275
+ |------|---------------------|---------------------|-----------|
276
+ | Extraction | 2000-3000 (100% ceiling) | ~300-800 (-60-70%) | "NΓ£o pense em excesso" + think penalty reward |
277
+ | Push | 1000-2000 | ~100-300 (-70-80%) | "Responda diretamente" + think penalty reward |
278
+ | SQL Q&A | 1500-2500 | ~400-800 (-50%) | "Seja conciso no raciocΓ­nio" + think budget reward |
279
+ | Insights | 2000-3200 (ceiling) | ~800-1500 (-30-40%) | "Use no mΓ‘ximo 500 tokens" + higher think budget |