File size: 12,178 Bytes
bead5cb | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 | # Session Checkpoint β 2026-04-23 22:30 CEST
## Status: v3 Training Launched (Cell 11)
Probe passed (3 steps). Full training run initiated: 500 steps, ~25h estimated on L4.
---
## Context
### Where we are in the pipeline
```
Qwen3-4B-Base β Polygl0t/Tucano2-qwen-3.7B-Base (PT continual pretrain)
β Polygl0t/Tucano2-qwen-3.7B-Think (SFT + thinking training)
β YOUR SFT adapter (domain e-commerce, 1650 samples)
β GRPO v2 (210 steps, early stopped) β +42% over SFT baseline
β GRPO v3 (launching now) β all fixes from ADR-001 + thinking control patch
```
### Training data
- 1,404 prompts after 15% eval holdout (from ~1,650 total)
- Distribution: extraction=659, sql_qa=655, insights=114, push=222
- Using ALL data (v2 used 300 subset β memorization β entropy collapse)
### Hardware
- NVIDIA L4 (24GB VRAM), Vertex AI Workbench
- Unsloth 2026.4.8, TRL 0.24.0 (pinned), Transformers 5.5.0
- Peak VRAM in smoke test: 6.8GB / 23.6GB (massive headroom)
### v2 results (baseline to beat)
- 210/300 steps, early stopped at eval plateau
- Validation mean reward: 0.54 (+42% vs SFT calibration 0.38)
- Strong on insights/analysis (0.50-0.70), broken on extraction (0.12)
- Critical issues: entropy collapse (clip_ratio=0), completion ceiling (100% at 2048), KL=0.004
---
## Problem
### Problem 1: Think model's `<think>` blocks consume all tokens
The model generates 2000-3000 tokens of `<think>` content before producing answers. At both 2048 (v2) and 4096 (v3) completion ceilings, extraction tasks never produce JSON β the model is stuck in `<think>` at inference time with low temperature.
**Evidence from v3 calibration (Cell 7, temp=0.7):**
- 8/8 samples hit 4096 ceiling
- Both extraction samples: stuck in `<think>`, reward=0.11-0.12
- Task-aware system prompts ("NΓ£o pense em excesso") had ZERO measurable effect
**However**, during GRPO training rollouts (temp=1.0):
- Smoke test: mean completion=528 tokens, 0% ceiling hits
- Probe step 2: mean completion=358 tokens, 0% ceiling hits
- Probe step 3: mean completion=1371 tokens, 25% ceiling hits (1 of 4)
- High temperature produces diverse SHORT completions β the model doesn't lock into verbose thinking
### Problem 2: Entropy collapse (inherited from v2)
- v2: clip_ratio=0 on ALL steps, KL=0.004 β policy never moved
- v3 probe: clip_ratio=0 on all 3 steps β but loss is nonzero (0.041 on step 3)
- May resolve after warmup; entropy monitor callback will detect if it persists
### Problem 3: Think model has no thinking toggle
- Checked Polygl0t/Tucano2-qwen-3.7B-Think model files
- `generation_config.json`: temperature=0.1, max_new_tokens=1024, no thinking control
- `chat_template.jinja`: always injects `<think>` on last assistant turn, no `enable_thinking` conditional
- Unlike official Qwen3-4B which has `enable_thinking=True/False` toggle
- Prompt-level control ("NΓ£o pense em excesso") proven ineffective at inference time
- L1 paper (2503.04697) confirms: untrained models ignore length instructions β need RL reward to learn compliance
---
## Decisions Made
### Decision 1: Proceed with v3 training on Think model despite ceiling issues
- **Rationale**: Probe shows completions are SHORT during training (temp=1.0). The ceiling problem only manifests at low-temperature inference. GRPO rollouts at temp=1.0 produce 358-528 token completions on average. Training will work even if post-training inference needs tuning.
- **Risk**: If model learns at temp=1.0 but can't transfer to temp=0.1 inference, we get good training metrics but poor deployment performance.
### Decision 2: Task-aware system prompts (3-change patch)
Applied and verified:
- **Cell 3**: 4 task-specific system prompts (extraction, sql_qa, insights, push) + `THINK_BUDGETS` + `get_system_prompt()`
- **Cell 6**: `reward_think_efficiency()` β penalizes bloated `<think>` blocks per task budget (extraction: 150 tok, push: 100, sql_qa: 400, insights: 800)
- **Cell 7**: `inject_task_system_prompt()` wired into calibration
- **Cell 8**: System prompt injection into training data via `prepare_grpo_datasets_v3()`
- **Cell 13**: Validation uses per-task system prompts
- **Research basis**: OptimalThinkingBench (2508.13141), Mid-Think (2601.07036), L1 (2503.04697)
- **Observed effect**: Zero at inference (calibration). Unknown during training β the reward signal may teach compliance over hundreds of steps.
### Decision 3: Plan base model training as next step
- Literature review conclusive: every canonical GRPO paper starts from base/instruct, not thinking models
- DeepSeek-R1-Zero proved thinking emerges from RL on base models
- ThinkJSON (2502.14905) beats R1-671B on JSON extraction using Qwen2.5-1.5B BASE + GRPO
- `Polygl0t/Tucano2-qwen-3.7B-Base` exists and has Portuguese continual pretraining
- Will need to re-run SFT (LoRA adapters are model-specific, can't transfer ThinkβBase)
### Decision 4: Did NOT add load_best_model_at_end
- Requires native eval loop with `metric_for_best_model` β our eval is a custom callback
- EvalRewardCallback tracks `best_reward` and `best_step` internally
- `SAVE_STEPS=10` + `SAVE_TOTAL_LIMIT=5` = 50 steps of checkpoint coverage β sufficient
---
## v3 Config (all changes from v2)
| Parameter | v2 | v3 | Paper Reference |
|-----------|-----|-----|----------------|
| Temperature | 0.8 | **1.0** | Skywork-OR1 (2505.22312) |
| max_completion_length | 2048 | **4096** | Dr. GRPO (2503.20783) |
| num_generations | 8 | **4** | MC-GRPO (2601.22582) β VRAM tradeoff |
| learning_rate | 5e-7 | **2e-6** | Dr. GRPO Appendix G |
| Ξ² (KL penalty) | implicit | **0.0** | Dr. GRPO Β§3.2 |
| Training data | 300 subset | **ALL ~1400** | Skywork-OR1 Β§3.1 |
| Rewards | single composite | **staged (formatβpartialβtask)** | Reasoning-SQL (2503.23157) |
| System prompts | single generic | **4 task-aware** | OptimalThinkingBench (2508.13141) |
| Think efficiency reward | none | **reward_think_efficiency()** | L1 (2503.04697) |
| Zero-advantage groups | included | **noise injection (Ο=0.005)** | Skywork-OR1 Β§3.1 |
| Entropy monitoring | none | **EntropyMonitorCallback** | Skywork-OR1 Β§4 |
| grad_accum | 2 | **1** | Effective batch 4 (was 8) |
| patience | 10 | **15** | More runway |
| delta | 0.01 | **0.005** | More sensitive |
| save_steps | 15 | **10** | Never lose best checkpoint |
| save_total_limit | 3 | **5** | More checkpoint coverage |
| eval_temperature | 0.7 | **0.1** | Deterministic eval |
| eval_max_tokens | 2048 | **4096** | Match training |
---
## Probe Results (3 steps)
| Step | Completion (mean) | Clipped | Reward | reward_std | Loss | clip_ratio |
|------|:-:|:-:|:-:|:-:|:-:|:-:|
| 1 | 528 | 0% | 0.419 | 0.049 | -0.0002 | 0 |
| 2 | 358 | 0% | 0.718 | 0.043 | 0.0001 | 0 |
| 3 | 1371 (one@4096) | 25% | 0.603 | 0.074 | **0.041** | 0 |
- `frac_reward_zero_std = 0` on all steps β v2's critical bug is fixed
- Step time: 65-420s depending on completion length. Average 180s/step.
- Estimated full run: 500 steps Γ 180s = ~25 hours
---
## Consequences
### What we expect from v3
- SQL/insights/push should improve β model produces answers, rewards have variance
- Extraction may or may not improve β depends on whether temp=1.0 rollouts produce enough JSON for the reward to shape behavior
- clip_ratio=0 may persist β if so, entropy collapse is still the failure mode, even with all fixes
- Training will be slow (~25h) due to occasional 4096-token completions
### What comes after v3
1. **Evaluate v3** β run benchmark, compare to v2 validation (mean=0.54)
2. **Document lessons** β update PROJECT.md with v3 findings
3. **Base model training** β `Polygl0t/Tucano2-qwen-3.7B-Base` β SFT β GRPO with shorter completions (512-1024)
4. **Hybrid deployment** β base model for extraction/SQL/push, think model for insights (if v3 insights are strong)
---
## Lessons Learned (this session)
### Technical
1. **Thinking models are incompatible with small completion budgets.** The `<think>` block is not controllable via system prompts on untrained models. L1 paper confirmed: length compliance requires RL training. On a 24GB GPU with 4096 max tokens, the think overhead leaves insufficient room for structured output.
2. **Temperature changes everything for GRPO.** At temp=0.7 (calibration), the model locks into verbose deterministic thinking β 100% ceiling hits. At temp=1.0 (training), the model explores diverse short completions β average 358-528 tokens. This is the single biggest factor determining whether GRPO works on this model.
3. **Calibration at inference temperature β training behavior.** The calibration cell uses temp=0.7 to simulate eval. But GRPO trains at temp=1.0. The calibration results (0.43 mean, 100% ceiling) are misleading β actual training dynamics are much healthier (0.60 mean, 25% ceiling). Future calibration should include a temp=1.0 pass.
4. **Every canonical GRPO paper starts from base/instruct, not thinking models.** DeepSeek-R1-Zero, Dr. GRPO, DAPO, ThinkJSON, Reasoning-SQL, RL-Struct β all start from base. Only Skywork-OR1 starts from a thinking model, and that's for squeezing marginal SOTA gains, not domain adaptation.
5. **LoRA adapters are model-specific.** Can't transfer SFT adapter from Think model to Base model β weights are calibrated to different base weight spaces. Switching to base requires re-running SFT.
6. **Transformers version drift causes warnings.** Unsloth 2026.4.8 pulls Transformers 5.5.0 (v2 had 4.57.6). TRL 0.24.0 was written for the older version β deprecation warnings about `generation_config` kwargs and `AttentionMaskConverter`. Harmless but noisy.
### Process
7. **Prompt engineering research before implementation saves compute.** The literature crawl found 6 papers on thinking control (OptimalThinkingBench, Mid-Think, L1, AdaptThink, TALE, ThoughtTerminator) in one research call. The finding that "Don't overthink" reduces tokens by 23% on Qwen3 was directly applicable β even though the effect was zero on this specific model at inference time, the `reward_think_efficiency()` function may still teach compliance during training.
8. **The model family tree matters.** Discovering that `Polygl0t/Tucano2-qwen-3.7B-Think` β `Polygl0t/Tucano2-qwen-3.7B-Base` β `Qwen/Qwen3-4B-Base` gave us a clean non-thinking alternative with Portuguese pretraining preserved. Without checking the Hub metadata, we might have defaulted to vanilla `Qwen3-4B-Base` and lost the Portuguese specialization.
9. **Log everything to W&B.** Moving W&B init to Cell 3 means all preflight checks (inference test, KV cache, calibration) are logged. When the notebook disconnects mid-calibration, the data survives. This was the user's idea β essential for long-running Vertex AI sessions.
---
## Files in repo
```
rtferraz/tucano2-commerce/
βββ docs/
β βββ PROJECT.md # Full project documentation
β βββ ADR-001-next-steps.md # Execution plans (benchmark, comparison, v3)
β βββ v3_thinking_control_patch.md # The 3-change patch spec
β βββ checkpoints/
β βββ 2026-04-23_v3-launch.md # β THIS FILE
βββ notebooks/
β βββ grpo_vertex_v3.ipynb # v3 notebook (patched, running)
βββ scripts/
β βββ md_to_ipynb.py # Markdown β notebook converter
βββ grpo_vertex_v2_ipynb.md # v2 reference with outputs
βββ .gitignore
```
---
## To resume this session
1. Check W&B: `tferrazrafael-self/tucano2-commerce` β look for run `grpo-v3-l4-*`
2. Check training progress: reward trend, clip_ratio, completion_length
3. If clip_ratio still 0 after step 50 β entropy collapse, consider stopping early
4. If completion_length trends toward 4096 β model learned to fill budget, think control failing
5. If reward improves and completion_length stays <2000 β v3 is working, let it run
6. After training: run Cell 12 (save), Cell 13 (validation), compare to v2 (mean=0.54)
7. Then: plan base model SFT + GRPO for extraction-focused training
|