---
title: GRPO Training Session Log
description: Chronological log of GRPO training runs on Qwen3-0.6B/1.7B covering nine runs, fixes applied, multi-turn SFT breakthrough, and capacity ceiling analysis
doc_type: exploration
---

# GRPO Training Session Log

## Context

Training Qwen3-1.7B as a SQL agent using SFT warmup + GRPO with TRL's `environment_factory` on Spider dataset. Running on Colab L4 (24GB).

Started 2026-04-02. Multi-turn SFT breakthrough on 2026-04-03.

## Key Findings & Fixes Applied

### 1. SFT Null-Param Injection (ROOT CAUSE of first collapse)
**Problem**: Qwen3's `apply_chat_template` expands dict arguments to include ALL parameter names from ALL tools with null values. SFT trained model to always generate `{"sql": null, "table_name": "X", "value": null}`.
**Fix**: Pass arguments as JSON strings (`json.dumps({"table_name": table})`) instead of dicts. Tokenizer uses strings verbatim.

### 2. SFT Answer Formatting
**Problem**: Gold answers were Python literals (`['a', 'b']`, `[[1, 'amc']]`). Model learned wrong format.
**Fix**: `_format_answer_for_model()` converts to human-readable: comma-separated lists, pipe-separated table rows.

### 3. Empty Tool Responses
**Problem**: TRL adapter returned `observation.result` (empty on SQL errors), hiding errors from model.
**Fix**: `_result_or_error()` falls back to `observation.error` so model sees "Error: SQL error: ...".

### 4. Post-Episode Penalty
**Problem**: Model continues calling tools after answering, wasting steps with no signal.
**Fix**: `_POST_EPISODE_PENALTY = -0.1` applied in all 4 tool methods when `self._done` is True.

### 5. Answer Stripping
**Problem**: Model wraps answers in quotes, code fences, "Answer:" prefix.
**Fix**: `_strip_answer_wrapping()` in verifier preprocesses predicted answers.

### 6. Per-Turn SFT → Multi-Turn SFT (ROOT CAUSE of Run 5 stall)
**Problem**: SFT generated one example per assistant turn (347 examples, ~50% describe calls). Model over-learned "call describe" and never practiced query→answer. During GRPO with KL penalty, model stayed anchored to this single-turn policy.
**Fix**: Generate one full multi-turn example per question (100 examples, each containing describe→query→answer). Enable `assistant_only_loss` via Qwen3 template patch so loss is on assistant turns only.
**Key detail**: Qwen3's chat template lacks `{% generation %}` tags required by TRL for `assistant_only_loss`. Patch the template before SFT, restore original before GRPO (TRL does exact-match template checks in `add_response_schema()` and `get_training_chat_template()`).

### 7. Removed Arrow-Notation Few-Shot Examples
**Problem**: System prompt contained few-shot examples using arrow notation (`→ describe(table_name="X")`) while the model must produce `<tool_call>{"name":"describe","arguments":...}</tool_call>` JSON. Two competing formats for a 1.7B model.
**Fix**: Removed `_FEW_SHOT_BLOCK` from system prompt. The textual "Strategy" section is sufficient.

### 8. KL Penalty + Curriculum
**Problem**: GRPO drifted policy away from SFT, causing `<tool_response>` instead of `<tool_call>`.
**Fix**: `beta=0.04` KL penalty + easy-first curriculum (phase 1: easy only, phase 2: easy+medium). With multi-turn SFT, beta=0.04 no longer blocks exploration.

### 9. OOM with Reference Model
**Problem**: `beta>0` loads reference model copy, doubling memory on L4.
**Fix**: Reduced `num_generations` 6→4, `max_new_tokens` 1024→512 for phase 1. Phase 2 drops beta=0 and uses 1024 tokens.

### 10. generation_batch_size Divisibility
**Problem**: `generation_batch_size` (default 8) not divisible by `num_generations` (6).
**Fix**: Set `generation_batch_size=config.num_generations` in notebook_pipeline.

## Discovered Issues (not yet fixed)

### CTE (WITH clause) rejected by environment
**Problem**: `sql_environment.py` SQL validation only allows queries starting with `SELECT`. The model discovers CTEs during GRPO (`WITH dogs AS (...) SELECT ...`), gets `"Error: Only SELECT queries are allowed. Got: WITH"`, wastes a step recovering.
**Impact**: Burns 1-2 steps on error recovery, reducing reward. Teaches model to avoid CTEs even though they're valid read-only SQL.
**Root cause**: Hard-coded prefix check. The DB is already opened with `mode=ro`, so SQLite itself would reject writes.
**Fix**: Allow `WITH` as a valid query prefix, or remove the prefix check entirely and rely on `mode=ro`.

### Post-episode repetition
**Problem**: Model keeps calling tools after episode ends (gets `{'error': 'Episode is over'}`). The -0.1 penalty exists but model still does 3-5 extra calls.
**Possible fixes**: Increase penalty, or the model may learn to stop as GRPO training progresses.

### HF_SUFFIX naming bug (FIXED)
**Problem**: `HF_SUFFIX` is concatenated directly onto `grpo` without auto-prepending a dash. Setting `HF_SUFFIX="no-no-thinking"` produces `sqlenv-qwen3-1.7b-grpono-no-thinking` instead of the intended `sqlenv-qwen3-1.7b-grpo-no-no-thinking`. The `grpono-no-thinking` checkpoint on HF Hub was manually renamed via HF UI after push.
**Root cause**: Format string `f"sqlenv-{_model_short}-grpo{HF_SUFFIX}"` expects the user to include a leading dash.
**Fix**: Auto-prepend dash + strip existing prefixes from checkpoint names. When resuming from `hjerpe/sqlenv-qwen3-0.6b-grpo`, the old code produced `sqlenv-sqlenv-qwen3-0.6b-grpo-grpo-v2` (double prefix). Now strips `sqlenv-` and `-grpo*` from `_model_short` before rebuilding the name.
**Files**: `notebooks/train_grpo.ipynb` save cell.

### Save cell uses Phase 1 config for output_dir
**Problem**: `model.save_pretrained(config.output_dir)` uses Phase 1's `config`, not Phase 2's `config2`. Both phases write to `outputs/grpo_run` — Phase 2 overwrites Phase 1 checkpoints in the same directory.
**Impact**: Not a correctness bug (the final model weights are from Phase 2, which is correct), but fragile if you want to preserve Phase 1 checkpoint separately.
**Fix**: Use `config2.output_dir` in the save cell, or save Phase 1 to a separate directory before Phase 2 starts.

## Training Runs

### Run 1 (pre-fixes): SFT OK, GRPO plateau at ~30-40% accuracy
- Model learned tool-calling but rewards flat, advantage=0 most steps
- Identified: no penalty for post-episode, answer format issues

### Run 2 (batch 1 fixes): GRPO collapse — null args
- SFT taught `{"sql": null, "table_name": "X", "value": null}`
- Every rollout got TypeError → reward=0 → no gradient signal
- Root cause: Qwen3 tokenizer expanding dict args

### Run 3 (JSON string args fix): GRPO collapse — format drift
- SFT clean, first ~30 steps showed correct tool calls
- By step 40+: model output `<tool_response>` instead of `<tool_call>`
- GRPO drifted structural tokens without KL penalty

### Run 4 (KL penalty beta=0.04): OOM
- Reference model doubled memory, exceeded L4 24GB

### Run 5 (beta=0.04, reduced tokens/generations): KL too conservative
- No collapse, correct format, but reward=0.00 everywhere
- Model only generates single describe call per rollout
- KL penalty keeps model too close to single-turn SFT policy
- All 4 rollouts identical → advantage=0 → no learning

### Run 6 (multi-turn SFT + assistant_only_loss): First successful training
- Switched SFT from per-turn (347 examples) to multi-turn (100 full trajectories)
- Enabled `assistant_only_loss` via Qwen3 template patch
- Removed arrow-notation few-shot examples from system prompt
- **Phase 1** (435 easy, beta=0.04, 512 tokens, ~2h50m):
  - Clear upward reward trend: ~0.15 → 0.5-0.75
  - Loss trends upward 0→0.14, showing learning from reward signal
  - Model writes JOINs, GROUP BY HAVING, NOT IN subqueries, uses `sample` tool
  - Recovers from SQL errors (wrong column → retry, CTE rejected → plain JOIN)
  - CTE (WITH) queries rejected by environment — wasted steps
- **Phase 2** (467 easy+medium, beta=0, 1024 tokens, ~3h37m):
  - Reward holds ~0.5 average, no format collapse without KL
  - Peak rewards reach 0.93
  - Correct answers on COUNT, AVG, GROUP BY, multi-table JOINs, subqueries
  - Medium questions harder — more column-name errors, alias confusion
  - Final reward: 0.64
- **Persistent issues**:
  - Error loop: model repeats same failing query without changing it (step 140: "no such column: bonus" 7 times)
  - Table alias confusion: `T2.column` when column is on T1
  - Missing DISTINCT in COUNT queries
  - Post-episode repetition: 1-3 extra calls after correct answer
  - Empty `<think>` blocks — model not reasoning about errors

## Changes for Run 7

Applied after Run 6 analysis:

### 11. Allow CTE (WITH) queries
**Fix**: Changed SQL validation from `first_keyword != "SELECT"` to `first_keyword not in ("SELECT", "WITH")`.
**Files**: `server/sql_environment.py` (both `_execute_gold_sql` and `_execute_sql`)

### 12. Increase post-episode penalty
**Fix**: `_POST_EPISODE_PENALTY` from -0.1 to -0.3. The -0.1 penalty wasn't strong enough — model still made 3-5 extra calls after episode end.
**File**: `training/trl_adapter.py`

### 13. HF Hub suffix for model versioning
**Fix**: Added `HF_SUFFIX` parameter to save cell. Set to e.g. "-v2" or "-cte" to push to `hjerpe/sqlenv-qwen3-1.7b-grpo-v2`.
**File**: `notebooks/train_grpo.ipynb` cell 9

### Run 7 (repeat penalty + configure fix): Stable reward, multi-table weakness exposed
- **Date**: 2026-04-05
- **Changes**: F015 error-repetition penalty (`_REPEAT_PENALTY = -0.2`, 3-call deque window), removed public `configure()` that TRL misidentified as a tool
- **Branch**: `feat/error-repetition-penalty`
- **SFT**: 120 multi-turn trajectories, 2 epochs, loss 2.2→0.06, assistant-only loss enabled. 14% assistant tokens. Post-SFT format check: all 3 samples produce correct `<tool_call>` JSON with `describe` as first move.
- **Phase 1** (435 easy, beta=0.04, 512 tokens, ~2h):
  - Reward: −0.1 → 0.7 peak, stabilizing 0.3-0.7. Loss spike at step 320 (1.8) recovered.
  - Model learned: `describe` → `query` → `answer`, comma-separated lists, pipe-delimited rows, `[]` for empty results, `UNION` queries, `NOT IN` subqueries, `LIKE '%North%'`.
  - Repeat penalty observable: step 100 reward −0.22 (model re-described same table), step 120 reward −0.24 with repeat penalty stacking.
  - Error recovery improved: after SQL error, model calls `describe` on the failing table then retries with correct column names (steps 110, 140).
  - Persistent: hallucinated column names from pretraining (T_full_name), `ORDER BY count(*) DESC` without `GROUP BY`, CTE queries still rejected.
- **Phase 2** (467 easy+medium, beta=0.0, 1024 tokens, ~2h22m):
  - Reward oscillated 0.0–1.15, no clear upward trend vs Phase 1. Mean reward ~0.5.
  - Single-table questions consistently correct (count, filter, aggregate, WHERE + GROUP BY HAVING).
  - Multi-table JOIN weakness: can't follow FK chains (Documents→Templates→Ref_Template_Types), joins on wrong keys, hallucinates join columns.
  - Repeat penalty firing on multi-table failures: step 150 reward −0.58 (5+ repeated failed JOINs on `T2.Template_ID`).
  - New behavior: model answers `[]` for genuinely empty results, learned `"No results"` → `"[]"` mapping.
  - Step 80 (Phase 2): 1.15 reward, advantage +1.50 — model wrote `SELECT avg(weight), year FROM cars_data GROUP BY year` with 13-row correct answer in 2 tool calls. Peak efficiency.
  - Final reward: 0.61.
- **Persistent issues**:
  - Multi-table JOINs: model can't chain through intermediate tables (needs the question-to-FK-path reasoning that 1.7B lacks without thinking)
  - Answer hallucination when query returns empty: submits "No data available" or "N/A" instead of trying different query
  - `describe` repeat on already-described tables (penalty fires but model still does it)
  - Step 430: hex-encoded query string (`0x45636365646965...`) — degenerate output near end of training

### Run 8 (thinking mode): Thinking helps error recovery but introduces degenerate loop
- **Date**: 2026-04-06
- **Changes**: F012 `enable_thinking` config flag, `ENABLE_THINKING = True` in notebook, max_new_tokens 768 (Phase 1) / 1280 (Phase 2)
- **Branch**: `feat/enable-thinking-mode`
- **SFT**: Same 120 multi-turn trajectories as Run 7, but system prompt omits `/no_think` prefix. SFT data itself has no `<think>` blocks (approach B: let GRPO discover thinking).
- **Phase 1** (435 easy, beta=0.04, 768 tokens, ~4.5h):
  - Loss 0.31→oscillating 0.05-0.40 throughout. No clear trend.
  - Correct answers on ~50% of sampled steps (reward 1.15). Similar to Run 7 on easy questions.
  - **Thinking triggers on errors**: Step 90 — after 2 SQL errors (`no such column: airport_code`), model opens `<think>`, reasons about column name mismatch, then generates correct `AirportCode` query. Step 180 — reasons about `course_title` vs `course_name` after error, corrects to right column.
  - **Empty think blocks for easy questions**: Steps 20-80 all show `<think></think>` with no content — model skips thinking when confident. Good token efficiency.
  - **NEW failure mode: `<think>assistant` degenerate loop** — ~10/43 sampled steps (23%) show `<think>assistant<think>assistant...` repeating until token limit. Model fails to close `</think>` and enters repetitive pattern. Steps 110, 140, 200, 260, 300, 340, 410, 420, 430 all exhibit this. Burns entire token budget with no useful output.
  - Multi-table JOINs with subqueries work (Step 30: `NOT IN` subquery, Step 80: UNION, Step 435: correlated subquery with HAVING).
  - Final step 435: model writes complex correlated subquery with `HAVING count(*) = (SELECT ... ORDER BY count(*) DESC LIMIT 1)` — correct answer "Martin".
- **Phase 2** (467 easy+medium, beta=0.0, 1280 tokens, stopped at step 182/467 — likely OOM):
  - Reward oscillated 0.1-0.85, averaging ~0.45. Comparable to Run 7 Phase 2 (~0.5).
  - Step 10: Easy question solved in 3 tool calls (describe→query→answer). Reward 1.15.
  - Step 90: Multi-table JOIN with `HAVING count(*) < 200` — correct, reward 1.15.
  - Step 110: `NOT IN` subquery for stadiums without concerts — correct on first try.
  - Step 140: Cross-table JOIN (evaluation + employee, `MAX(bonus)`) — correct.
  - Step 150: Multi-table chain reasoning with thinking — corrected `Document_Name` → `Template_ID` join path after 2 errors. Long `<think>` block with correct reasoning.
  - Step 170: Double-year intersection query (`Stadium_ID IN ... 2014 AND Stadium_ID IN ... 2015`) — correct.
  - **Crashed at step 182** — likely OOM from 1280 max_new_tokens + thinking blocks consuming more memory during generation.
  - Model checkpoint was NOT pushed to HF Hub before crash.
- **Persistent issues**:
  - `<think>assistant` degenerate loop (~23% of Phase 1 steps) — new failure mode unique to thinking mode
  - Multi-table FK chain queries still fail on medium difficulty (same as Run 7)
  - Phase 2 no better than Run 7's Phase 2 — thinking mode doesn't help with the fundamental JOIN reasoning gap

### Run 9 (v2 continued training, no-think): Confirms Phase 2 ceiling
- **Date**: 2026-04-11
- **Changes**: Resumed from v1 checkpoint (Run 7's final weights), 2 epochs Phase 1 + 2 epochs Phase 2. Fixed model preset lookup (`_get_preset()` matching on "1.7b" in name string instead of exact `.get()`).
- **Branch**: `feat/f011-3-way-comparison-notebook`
- **Phase 1** (435 easy, beta=0.04, 512 tokens, ~3h34m, 870 steps):
  - Loss: oscillates 0.01-0.13, occasional negatives (-0.05) in second half. More negative values than v1 Phase 1 — expected since starting from trained checkpoint, less to learn.
  - Rewards: sawtooth 0.01-1.15. Easy questions solved reliably (describe→query→answer in 3 calls). Medium questions from mixed batches still fail.
  - Model behavior: solid tool-call format, comma-separated lists, pipe-delimited rows. No format collapse.
  - Step 300: Degenerate SQL — `ORDER BY HorsepowerDESC` (missing space), repeated 3 times. Token budget consumed.
  - Step 560: Degenerate completion — output "icher Consulting Solution" (truncated gibberish). Reward 0.00. One-off.
- **Phase 2** (467 easy+medium, beta=0.0, 1024 tokens, ~3h50m, 934 steps):
  - Loss: oscillates -0.13 to +0.12, trend more negative than Phase 1 — policy sharpens on known patterns without KL regularization.
  - Rewards: same sawtooth 0.01-1.15 as Phase 1, no upward trend. Mean ~0.5.
  - **Successes (medium)**: Step 140 — JOINed evaluation→employee for MAX(bonus), found "Louis Deacon" (1.13 reward). Step 750 — subquery `COUNT(*) > (SELECT ... ORDER BY Horsepower DESC LIMIT 1)`, answered "39" correctly.
  - **Failures (medium)**: Step 20 — hallucinated `make_id`, `full_name` columns, budget exhausted after 8+ tool calls. Step 50 — invented `Course_Attendance` table, cascading errors. Step 530 — tried `Bred`, `Breed` before finding `Breeds`, then queried wrong column.
  - **Persistent pattern**: Model describes tables correctly but writes SQL with wrong column names from pretraining knowledge (e.g., `full_name` instead of `FullName`, `country.name` when table is `singer` with `Country` column).
  - Final reward: 0.048 (last step was incorrect)
- **Charts**: Reward Trend (Phase 1→2) shows flat continuation — no improvement from adding medium questions. Loss in Phase 2 oscillates around 0, with spikes to -0.13 (GRPO reinforcing already-known easy patterns).
- **Conclusion**: v2 confirms v1 findings. The 0.6B model's accuracy ceiling is set by pretraining SQL knowledge, not RL training budget. More epochs don't help medium questions. Next interventions: (1) more SFT on multi-table JOINs with correct column names, (2) larger model (1.7B), or (3) increase step budget to let model iterate.

### Eval Format Fix (F011 comparison notebook)
- **Date**: 2026-04-10
- **Problem**: `compare_methods.ipynb` eval fed models a different message format than TRL training:
  1. Tool results posted as `role: "user"` — training uses `role: "tool"` (Qwen3 renders as `<tool_response>` wrapper)
  2. Assistant turns stored as raw text content — training uses structured `tool_calls` dicts with JSON-string arguments
  3. Question + table hint separated by `\n\n` — TRL appends `reset()` return directly to user message (no separator)
- **Discovery method**: Added debug cell to render prompts via `apply_chat_template` and compared side-by-side with TRL training log output. The `role: "tool"` format renders as `<|im_start|>user\n<tool_response>...</tool_response>` while `role: "user"` renders as `<|im_start|>user\nplain text` — structurally different despite both appearing under `user` token.
- **Fix**: Changed `LLMToolCallingPolicy` in compare_methods.ipynb to match TRL exactly: structured `tool_calls`, `role: "tool"`, concatenated user message. Also parse ALL `<tool_call>` blocks per generation and buffer extras (matches TRL's `_tool_call_loop`).
- **Result (N=50, base=Qwen3-0.6B, 2026-04-11, with parse-failure retry, 2 runs)**:
  - **Run A:**
    - zero-shot: 0% accuracy, 28% parse rate, avg 10.8 steps (31/50 budget exhaust)
    - 1-shot: 0% accuracy, 16% parse rate, avg 14.8 steps (49/50 budget exhaust)
    - 3-shot: 0% accuracy, 20% parse rate, avg 13.8 steps (44/50 budget exhaust)
    - grpo-v1: 28% accuracy, 95% parse rate, avg 4.0 steps, avg reward 0.355
    - grpo-v2: 32% accuracy, 87% parse rate, avg 3.7 steps, avg reward 0.400
  - **Run B (same day, different Colab session):**
    - zero-shot: 0% accuracy, 24% parse rate, avg 12.4 steps (38/50 budget exhaust)
    - 1-shot: 2% accuracy, 17% parse rate, avg 14.0 steps (46/50 budget exhaust)
    - 3-shot: 0% accuracy, 19% parse rate, avg 14.8 steps (49/50 budget exhaust)
    - grpo-v1: 30% accuracy, 100% parse rate, avg 3.5 steps, avg reward 0.386
    - grpo-v2: 24% accuracy, 95% parse rate, avg 3.6 steps, avg reward 0.321
  - **Run-to-run variation**: v1 scored 28% then 30%, v2 scored 32% then 24%. The ~6-8pp swing confirms v1 and v2 are statistically indistinguishable at N=50. Report as "~30% accuracy" for both.
  - Parse failure retry: base models no longer die on first parse failure — they get a no-op DESCRIBE and continue. This reveals they waste their entire 15-step budget repeating the same malformed output.
  - Base model failure mode: can't produce `<tool_call>` format (76-83% parse failure rate). GRPO failure mode: produces valid tool calls but writes wrong SQL.
  - 1-shot scored 2% in Run B (1 lucky episode) — demonstrates N=50 noise floor for rare events.
- **Checkpoint naming**: `grpono-no-thinking` was caused by `HF_SUFFIX="no-no-thinking"` (missing leading dash) and subsequent HF UI rename. See "Discovered Issues" section.
- **TRL format verified from source**: `reset()` return is appended to last user message (TRL docs + grpo_trainer.py). Tool results use `{"role": "tool", "name": name, "content": result}`. Generation runs to EOS (no stop at `</tool_call>`), all parsed tool calls executed in sequence.

## Current Status (after Run 9)

### Working:
- Multi-turn SFT + `assistant_only_loss` — still the critical foundation
- GRPO learns on easy questions: reward −0.1→0.7 in Phase 1 (both Run 7 and 8)
- Repeat penalty (F015) fires correctly on exact-match repeated calls
- Error recovery: describe→retry after SQL error is a learned behavior
- Answer format: single values, comma-separated lists, pipe-delimited rows, `[]` for empty
- **Thinking mode triggers on errors** — model reasons about column name mismatches and table structure after SQL errors (Steps 90, 150, 180, 220, 280 in Run 8)
- **Empty think blocks for easy questions** — model doesn't waste tokens thinking when confident

### Not yet working:
- Multi-table FK chain queries (medium difficulty) — confirmed across Runs 7, 8, 9. More RL epochs don't help.
- Phase 2 shows no improvement over Phase 1 — medium questions need more SFT coverage on JOIN patterns
- Column name hallucination from pretraining — model reads schema correctly then writes pretrained column names
- Model doesn't use `sample` tool (learned in Run 6 but lost?)
- **`<think>assistant` degenerate loop** — thinking mode (Run 8) introduces ~23% failure rate from unclosed think tags

### For comparison notebook (F011):
- **v1 checkpoint** on HF Hub: `hjerpe/sqlenv-qwen3-0.6b-grpo`
- **v2 checkpoint** on HF Hub: `hjerpe/sqlenv-qwen3-0.6b-grpo-v2`
- **Run 8 (thinking)** checkpoint was NOT pushed — Colab session crashed before save
- N=50 eval completed 2026-04-11 (2 runs): v1 ~28-30%, v2 ~24-32%, confirming both are ~30% and within run-to-run noise
- v1 and v2 are statistically indistinguishable — the difference between runs is larger than the difference between checkpoints
- Thinking mode comparison can be added later when a checkpoint is available

### Possible next interventions:
- **Thinking mode training (0.6B)**: Resume from v1 with `ENABLE_THINKING=True`, push as `-think` suffix. Run 8 showed thinking helps error recovery but crashed before save.
- **More SFT on multi-table JOINs**: Add trajectories with 3+ table chains, correct column names after describe. Highest priority — v2 proved more RL epochs don't help without this.
- **Increase model size**: Switch from 0.6B to 1.7B. Larger model may override pretrained column name biases from schema context.

### OOM prevention for next thinking-mode run:
The Run 8 Phase 2 crash at step 182/467 was likely OOM. Root causes and mitigations:

1. **`max_new_tokens=1280` is too high for L4 with thinking** — medium questions trigger long `<think>` blocks (Step 50 reasoning about `>1` vs `>=1`, Step 120 about breed/size format, Step 130 about `T1.distinct_city` column mismatch). Reduce to **1024** for Phase 2.
2. **`num_generations=4` compounds the problem** — each generation runs inference independently, so 4 rollouts × 1280 tokens = 5120 tokens of peak generation memory. Reduce to **3 generations** for thinking-mode Phase 2. The `generation_batch_size` must also be updated to match.
3. **The `<think>assistant` degenerate loop inflates effective token usage** — a rollout that enters the loop consumes the full `max_new_tokens` budget producing garbage. Fixing this loop via SFT (adding 5-10 examples with proper `<think>reasoning</think>` blocks) would reduce average token consumption significantly, making OOM less likely even at higher token limits.
4. **Phase 2 has no KL reference model (beta=0)** — so memory is only model + generation buffers. The OOM is purely from generation length, not model copies.

**Recommended config for next thinking-mode run (Phase 2):**
```python
config2 = replace(config,
    beta=0.0,
    max_new_tokens=1024,      # was 1280
    num_generations=3,         # was 4
    enable_thinking=True,
)
```
Also set `generation_batch_size=3` in `notebook_pipeline.py` (it must equal `num_generations`).

## Historical: Status after Run 6

### Architecture decisions to preserve:
- Multi-turn SFT with `assistant_only_loss` — critical over per-turn
- Qwen3 template patch (`{% generation %}` tags) for SFT, restore original before GRPO
- SFT args as JSON strings (not dicts) — critical for Qwen3
- Phase 1 (easy, KL) → Phase 2 (easy+medium, no KL)
- DB opened with `mode=ro` — safety enforced by SQLite, not regex

## File Map

| File | What changed |
|------|-------------|
| `scripts/generate_sft_data.py` | Multi-turn trajectories, JSON string args, answer formatting |
| `scripts/inspect_sft_data.py` | SFT data stats + tokenizer-rendered inspection |
| `training/trl_adapter.py` | Post-episode penalty (-0.3), error surfacing, `_result_or_error` |
| `training/config.py` | Added beta field (KL penalty) |
| `training/notebook_pipeline.py` | generation_batch_size, beta passthrough |
| `server/verifier.py` | `_strip_answer_wrapping` preprocessing |
| `server/sql_environment.py` | SQL validation allows SELECT and WITH |
| `notebooks/train_grpo.ipynb` | Multi-turn SFT, assistant_only_loss, template patch/restore, HF_SUFFIX |

## Key Learnings

1. **Qwen3's apply_chat_template expands dict args** — always use JSON strings for SFT tool_call arguments.
2. **Multi-turn SFT is critical for agentic GRPO** — per-turn examples teach one action; the model never learns the full workflow. Full trajectory SFT with `assistant_only_loss` teaches describe→query→answer as a coherent strategy.
3. **Qwen3 template lacks {% generation %} tags** — patch before SFT for `assistant_only_loss`, restore before GRPO. TRL's `add_response_schema()` and `get_training_chat_template()` do exact string equality on the template.
4. **Don't show competing formats to small models** — arrow-notation few-shot examples confused the model when it needed to produce `<tool_call>` JSON.
5. **KL penalty effectiveness depends on SFT quality** — beta=0.04 was "too high" only because the SFT policy was single-turn. With multi-turn SFT, the same beta works fine.
6. **Reference model doubles memory** — plan for this when using KL penalty on L4.
7. **Let the SQL engine enforce safety, not regex** — hard-coded `SELECT`-only prefix check blocks valid read-only SQL (CTEs). The DB is already `mode=ro`.
8. **Render training data through the actual tokenizer** — inspect scripts that reformat JSON are fragile. The ground truth is `apply_chat_template` output from the same tokenizer instance used for training.
9. **Error loops are a 1.7B capacity limit** — the model repeats failing queries verbatim because `<think>` is suppressed and it can't reason about the error. Enabling thinking mode may help.
10. **Post-episode penalty of -0.1 is too weak** — model still makes 3-5 extra calls. Increased to -0.3.
11. **Repeat penalty works but doesn't fix root cause** — the −0.2 penalty fires correctly on exact-match repeated tool calls, but the model's real problem is pretrained column-name hallucination, not repetition per se. The model varies its queries enough to avoid exact repeats while still failing on the same conceptual error.
12. **Phase 2 (medium) doesn't improve over Phase 1 (easy)** — reward plateau at ~0.5 suggests the model needs more SFT coverage on multi-table JOINs, not just more GRPO steps. RL can't teach FK chain reasoning that isn't in the initial policy.
13. **Thinking mode helps error recovery but doesn't improve overall accuracy** — the model uses `<think>` blocks to reason about SQL errors (column name mismatches, table structure), leading to correct retries. But accuracy on easy questions is similar to no-think Run 7. The benefit is qualitative (better error recovery) not quantitative (higher reward).
14. **`<think>assistant` degenerate loop is a new failure mode** — ~23% of thinking-mode steps degenerate into `<think>assistant<think>assistant...` repeating until token limit. The model fails to produce `</think>` and enters a repetitive pattern. This is the thinking-mode equivalent of Run 7's post-episode repetition. Fix: add SFT examples with proper `<think>reasoning</think>` blocks.
15. **Empty `<think></think>` blocks are good** — the model learns to skip thinking on easy questions, preserving tokens for tool calls. This is emergent behavior from GRPO reward signal (thinking wastes tokens → lower reward on easy questions).
16. **1280 max_new_tokens is too aggressive for thinking mode on L4** — Phase 2 crashed at step 182/467, likely OOM. The longer `<think>` blocks in Phase 2 (medium questions trigger more reasoning) push memory past L4's 24GB. Use 1024 max_new_tokens for thinking-mode Phase 2.
17. **Public methods on environment_factory become TRL tools** — TRL introspects all public methods for JSON schema generation. The `configure()` classmethod caused a `DocstringParsingException`. Keep configuration methods private (`_configure`).
18. **Continued training from checkpoint doesn't unlock medium questions** — v2 ran 2 more epochs of Phase 1 + Phase 2 from v1's final checkpoint. Reward stayed flat at ~0.5 mean. The model reliably solves easy single-table queries but can't learn multi-table FK chain reasoning from RL alone. The policy needs SFT coverage on the patterns it can't discover through trial-and-error.
19. **Column name hallucination is the dominant error mode** — the model describes tables correctly (seeing `FullName: TEXT`) then writes `SELECT full_name` or `SELECT Maker, FullName FROM car_makers ORDER BY MakerDESC LIMIT 1` (missing space). This is pretrained SQLese overriding the schema information the model just read. A 0.6B model can't override pretraining biases through RL reward signal alone.
20. **Eval must exactly match TRL's message format** — `role:"tool"` for env results (not `role:"user"`), structured `tool_calls` dicts for assistant turns (not raw `<tool_call>` text in content), question+table_hint concatenated without separator (TRL appends `reset()` return to last user message). Qwen3 renders `role:"tool"` as `<|im_start|>user\n<tool_response>...</tool_response>` — looks like a user message but is structurally different. Getting this wrong caused 0% accuracy across all conditions; fixing it recovered 10-50% on base model.
21. **Incorrect answer reward of 0.0 creates an avoid-answering incentive** — exploration steps accumulate ~0.05-0.15 reward. Calling `answer(wrong)` gives 0.0 and ends the episode, so total reward (~0.05) can be lower than not answering and exploring until budget (~0.10). The model may learn to write prose instead of calling `answer()` when uncertain. PRS (Progressive Reward Shaping, arxiv 2512.07478) addresses this with a small format-compliance reward for completing the tool pipeline regardless of correctness.
22. **Continued training trades guessing for abstention** — v2 outputs "Task complete." instead of calling `answer()` on hard questions — a form of calibrated uncertainty. v1 guesses more but gets fewer right per attempt. The 0.0 incorrect-answer reward (learning #21) drives this: v2 internalized that guessing wrong is worse than not answering.
23. **v1 and v2 are statistically indistinguishable at N=50** — across two runs, v1 scored 28% then 30%, v2 scored 32% then 24%. The ~6-8pp run-to-run variation exceeds the checkpoint difference. v2's abstention behavior (learning #22) adds variance: on borderline questions, whether v2 guesses or outputs "Task complete." varies by run. For reporting, use "~30% accuracy" for both checkpoints. N=200+ would be needed to detect a real 4pp difference with 80% power.