tucano2-commerce / docs /checkpoints /2026-04-23_v3-launch.md

Add session checkpoint: v3 launch decision with full context

bead5cb verified 14 days ago

preview code

raw

history blame contribute delete

12.2 kB

Session Checkpoint — 2026-04-23 22:30 CEST

Status: v3 Training Launched (Cell 11)

Probe passed (3 steps). Full training run initiated: 500 steps, ~25h estimated on L4.

Context

Where we are in the pipeline

Qwen3-4B-Base → Polygl0t/Tucano2-qwen-3.7B-Base (PT continual pretrain)
    → Polygl0t/Tucano2-qwen-3.7B-Think (SFT + thinking training)
        → YOUR SFT adapter (domain e-commerce, 1650 samples)
            → GRPO v2 (210 steps, early stopped) — +42% over SFT baseline
                → GRPO v3 (launching now) — all fixes from ADR-001 + thinking control patch

Training data

1,404 prompts after 15% eval holdout (from ~1,650 total)
Distribution: extraction=659, sql_qa=655, insights=114, push=222
Using ALL data (v2 used 300 subset → memorization → entropy collapse)

Hardware

NVIDIA L4 (24GB VRAM), Vertex AI Workbench
Unsloth 2026.4.8, TRL 0.24.0 (pinned), Transformers 5.5.0
Peak VRAM in smoke test: 6.8GB / 23.6GB (massive headroom)

v2 results (baseline to beat)

210/300 steps, early stopped at eval plateau
Validation mean reward: 0.54 (+42% vs SFT calibration 0.38)
Strong on insights/analysis (0.50-0.70), broken on extraction (0.12)
Critical issues: entropy collapse (clip_ratio=0), completion ceiling (100% at 2048), KL=0.004

Problem

Problem 1: Think model's `<think>` blocks consume all tokens

The model generates 2000-3000 tokens of <think> content before producing answers. At both 2048 (v2) and 4096 (v3) completion ceilings, extraction tasks never produce JSON — the model is stuck in <think> at inference time with low temperature.

Evidence from v3 calibration (Cell 7, temp=0.7):

8/8 samples hit 4096 ceiling
Both extraction samples: stuck in <think>, reward=0.11-0.12
Task-aware system prompts ("Não pense em excesso") had ZERO measurable effect

However, during GRPO training rollouts (temp=1.0):

Smoke test: mean completion=528 tokens, 0% ceiling hits
Probe step 2: mean completion=358 tokens, 0% ceiling hits
Probe step 3: mean completion=1371 tokens, 25% ceiling hits (1 of 4)
High temperature produces diverse SHORT completions — the model doesn't lock into verbose thinking

Problem 2: Entropy collapse (inherited from v2)

v2: clip_ratio=0 on ALL steps, KL=0.004 — policy never moved
v3 probe: clip_ratio=0 on all 3 steps — but loss is nonzero (0.041 on step 3)
May resolve after warmup; entropy monitor callback will detect if it persists

Problem 3: Think model has no thinking toggle

Checked Polygl0t/Tucano2-qwen-3.7B-Think model files
generation_config.json: temperature=0.1, max_new_tokens=1024, no thinking control
chat_template.jinja: always injects <think> on last assistant turn, no enable_thinking conditional
Unlike official Qwen3-4B which has enable_thinking=True/False toggle
Prompt-level control ("Não pense em excesso") proven ineffective at inference time
L1 paper (2503.04697) confirms: untrained models ignore length instructions — need RL reward to learn compliance

Decisions Made

Decision 1: Proceed with v3 training on Think model despite ceiling issues

Rationale: Probe shows completions are SHORT during training (temp=1.0). The ceiling problem only manifests at low-temperature inference. GRPO rollouts at temp=1.0 produce 358-528 token completions on average. Training will work even if post-training inference needs tuning.
Risk: If model learns at temp=1.0 but can't transfer to temp=0.1 inference, we get good training metrics but poor deployment performance.

Decision 2: Task-aware system prompts (3-change patch)

Applied and verified:

Cell 3: 4 task-specific system prompts (extraction, sql_qa, insights, push) + THINK_BUDGETS + get_system_prompt()
Cell 6: reward_think_efficiency() — penalizes bloated <think> blocks per task budget (extraction: 150 tok, push: 100, sql_qa: 400, insights: 800)
Cell 7: inject_task_system_prompt() wired into calibration
Cell 8: System prompt injection into training data via prepare_grpo_datasets_v3()
Cell 13: Validation uses per-task system prompts
Research basis: OptimalThinkingBench (2508.13141), Mid-Think (2601.07036), L1 (2503.04697)
Observed effect: Zero at inference (calibration). Unknown during training — the reward signal may teach compliance over hundreds of steps.

Decision 3: Plan base model training as next step

Literature review conclusive: every canonical GRPO paper starts from base/instruct, not thinking models
DeepSeek-R1-Zero proved thinking emerges from RL on base models
ThinkJSON (2502.14905) beats R1-671B on JSON extraction using Qwen2.5-1.5B BASE + GRPO
Polygl0t/Tucano2-qwen-3.7B-Base exists and has Portuguese continual pretraining
Will need to re-run SFT (LoRA adapters are model-specific, can't transfer Think→Base)

Decision 4: Did NOT add load_best_model_at_end

Requires native eval loop with metric_for_best_model — our eval is a custom callback
EvalRewardCallback tracks best_reward and best_step internally
SAVE_STEPS=10 + SAVE_TOTAL_LIMIT=5 = 50 steps of checkpoint coverage — sufficient

v3 Config (all changes from v2)

Parameter	v2	v3	Paper Reference
Temperature	0.8	1.0	Skywork-OR1 (2505.22312)
max_completion_length	2048	4096	Dr. GRPO (2503.20783)
num_generations	8	4	MC-GRPO (2601.22582) — VRAM tradeoff
learning_rate	5e-7	2e-6	Dr. GRPO Appendix G
β (KL penalty)	implicit	0.0	Dr. GRPO §3.2
Training data	300 subset	ALL ~1400	Skywork-OR1 §3.1
Rewards	single composite	staged (format→partial→task)	Reasoning-SQL (2503.23157)
System prompts	single generic	4 task-aware	OptimalThinkingBench (2508.13141)
Think efficiency reward	none	reward_think_efficiency()	L1 (2503.04697)
Zero-advantage groups	included	noise injection (σ=0.005)	Skywork-OR1 §3.1
Entropy monitoring	none	EntropyMonitorCallback	Skywork-OR1 §4
grad_accum	2	1	Effective batch 4 (was 8)
patience	10	15	More runway
delta	0.01	0.005	More sensitive
save_steps	15	10	Never lose best checkpoint
save_total_limit	3	5	More checkpoint coverage
eval_temperature	0.7	0.1	Deterministic eval
eval_max_tokens	2048	4096	Match training

Probe Results (3 steps)

Step	Completion (mean)	Clipped	Reward	reward_std	Loss
1	528	0%	0.419	0.049	-0.0002
2	358	0%	0.718	0.043	0.0001
3	1371 (one@4096)	25%	0.603	0.074	0.041

frac_reward_zero_std = 0 on all steps — v2's critical bug is fixed
Step time: 65-420s depending on completion length. Average 180s/step.
Estimated full run: 500 steps × 180s = ~25 hours

Consequences

What we expect from v3

SQL/insights/push should improve — model produces answers, rewards have variance
Extraction may or may not improve — depends on whether temp=1.0 rollouts produce enough JSON for the reward to shape behavior
clip_ratio=0 may persist — if so, entropy collapse is still the failure mode, even with all fixes
Training will be slow (~25h) due to occasional 4096-token completions

What comes after v3

Evaluate v3 — run benchmark, compare to v2 validation (mean=0.54)
Document lessons — update PROJECT.md with v3 findings
Base model training — Polygl0t/Tucano2-qwen-3.7B-Base → SFT → GRPO with shorter completions (512-1024)
Hybrid deployment — base model for extraction/SQL/push, think model for insights (if v3 insights are strong)

Lessons Learned (this session)

Technical

Thinking models are incompatible with small completion budgets. The <think> block is not controllable via system prompts on untrained models. L1 paper confirmed: length compliance requires RL training. On a 24GB GPU with 4096 max tokens, the think overhead leaves insufficient room for structured output.
Temperature changes everything for GRPO. At temp=0.7 (calibration), the model locks into verbose deterministic thinking → 100% ceiling hits. At temp=1.0 (training), the model explores diverse short completions → average 358-528 tokens. This is the single biggest factor determining whether GRPO works on this model.
Calibration at inference temperature ≠ training behavior. The calibration cell uses temp=0.7 to simulate eval. But GRPO trains at temp=1.0. The calibration results (0.43 mean, 100% ceiling) are misleading — actual training dynamics are much healthier (0.60 mean, 25% ceiling). Future calibration should include a temp=1.0 pass.
Every canonical GRPO paper starts from base/instruct, not thinking models. DeepSeek-R1-Zero, Dr. GRPO, DAPO, ThinkJSON, Reasoning-SQL, RL-Struct — all start from base. Only Skywork-OR1 starts from a thinking model, and that's for squeezing marginal SOTA gains, not domain adaptation.
LoRA adapters are model-specific. Can't transfer SFT adapter from Think model to Base model — weights are calibrated to different base weight spaces. Switching to base requires re-running SFT.
Transformers version drift causes warnings. Unsloth 2026.4.8 pulls Transformers 5.5.0 (v2 had 4.57.6). TRL 0.24.0 was written for the older version → deprecation warnings about generation_config kwargs and AttentionMaskConverter. Harmless but noisy.

Process

Prompt engineering research before implementation saves compute. The literature crawl found 6 papers on thinking control (OptimalThinkingBench, Mid-Think, L1, AdaptThink, TALE, ThoughtTerminator) in one research call. The finding that "Don't overthink" reduces tokens by 23% on Qwen3 was directly applicable — even though the effect was zero on this specific model at inference time, the reward_think_efficiency() function may still teach compliance during training.
The model family tree matters. Discovering that Polygl0t/Tucano2-qwen-3.7B-Think → Polygl0t/Tucano2-qwen-3.7B-Base → Qwen/Qwen3-4B-Base gave us a clean non-thinking alternative with Portuguese pretraining preserved. Without checking the Hub metadata, we might have defaulted to vanilla Qwen3-4B-Base and lost the Portuguese specialization.
Log everything to W&B. Moving W&B init to Cell 3 means all preflight checks (inference test, KV cache, calibration) are logged. When the notebook disconnects mid-calibration, the data survives. This was the user's idea — essential for long-running Vertex AI sessions.

Files in repo

rtferraz/tucano2-commerce/
├── docs/
│   ├── PROJECT.md                          # Full project documentation
│   ├── ADR-001-next-steps.md               # Execution plans (benchmark, comparison, v3)
│   ├── v3_thinking_control_patch.md        # The 3-change patch spec
│   └── checkpoints/
│       └── 2026-04-23_v3-launch.md         # ← THIS FILE
├── notebooks/
│   └── grpo_vertex_v3.ipynb                # v3 notebook (patched, running)
├── scripts/
│   └── md_to_ipynb.py                      # Markdown → notebook converter
├── grpo_vertex_v2_ipynb.md                 # v2 reference with outputs
└── .gitignore

To resume this session

Check W&B: tferrazrafael-self/tucano2-commerce — look for run grpo-v3-l4-*
Check training progress: reward trend, clip_ratio, completion_length
If clip_ratio still 0 after step 50 → entropy collapse, consider stopping early
If completion_length trends toward 4096 → model learned to fill budget, think control failing
If reward improves and completion_length stays <2000 → v3 is working, let it run
After training: run Cell 12 (save), Cell 13 (validation), compare to v2 (mean=0.54)
Then: plan base model SFT + GRPO for extraction-focused training