---
title: "Plan: Break GRPO Training Plateau"
description: Ordered interventions to break the GRPO reward plateau on Qwen3-1.7B including curriculum, post-episode penalties, and few-shot guided rollouts
doc_type: exploration
---

# Plan: Break GRPO Training Plateau

## Context

GRPO training on Qwen3-1.7B completed 1 epoch (233 steps, 2.5h on L4) achieving ~30-40% accuracy on easy/medium Spider questions. The model learned multi-turn tool-calling (describe→query→answer) and generates real SQL, but rewards show no upward trend — loss oscillates near zero. Key problems: model doesn't stop after answering (wastes steps), low within-group variance (advantage=0), SQL quality errors, and answer format mismatches.

## Interventions (ordered by effort↑, value↓)

### 1. More epochs: 1→3 + num_generations: 4→6
**Effort: trivial | Value: high | Files: notebook only**

Change in `notebooks/train_grpo.ipynb` config cell (`978bc98f`):
```python
"Qwen/Qwen3-1.7B": dict(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    num_generations=6,       # was 4
    gradient_checkpointing=True,
),
...
num_train_epochs=3,          # was 1
```

**Why**: 1 epoch = each question seen once. 3 epochs with 6 generations = 18 rollouts per question. More variance within rollout groups (P(mixed outcomes) goes from 0.81→0.93 at 35% accuracy). Memory safe: generations run in inference mode.

**OOM fallback**: reduce `max_new_tokens` from 512→384.

### 2. Penalize post-episode actions
**Effort: low | Value: medium | File: `training/trl_adapter.py`**

The model keeps calling tools after "Episode is over" — wasting 5+ steps per episode with zero learning signal.

Add penalty constant:
```python
_POST_EPISODE_PENALTY = -0.1
```

In each of the 4 tool methods (describe/sample/query/answer), before the existing `raise ValueError`:
```python
if self._done:
    self.reward += _POST_EPISODE_PENALTY  # NEW
    raise ValueError("Episode is over")
```

**Why**: A rollout that answers correctly AND stops (reward 1.15) gets higher advantage than one that answers correctly but wastes 5 steps (reward 1.15 - 0.5 = 0.65). Creates gradient signal for "stop" behavior.

### 3. Relaxed answer matching
**Effort: low | Value: medium | File: `server/verifier.py`**

Add pre-processing in `verify_answer` to strip common LLM wrapping before type dispatch:
- Surrounding quotes: `"42"` → `42`
- Markdown code blocks: `` ```42``` `` → `42`
- "Answer: " prefix
- Extra whitespace

The existing type-aware verifier (integer coercion, float tolerance, list parsing) is already good — this catches the remaining format artifacts from the model.

### 4. Curriculum: easy first, then mixed
**Effort: medium | Value: high | Files: notebook, `training/config.py`, `training/data_loading.py`**

Split training into phases in the notebook:
```python
curriculum = [
    {"epochs": 1, "difficulties": ["easy"]},        # 435 questions
    {"epochs": 2, "difficulties": ["easy", "medium"]}, # 467 questions
]
```

Loop over phases, calling `build_trainer` each time with filtered prompts. Model carries forward between phases. No changes to `build_trainer` or `notebook_pipeline.py`.

**Why**: 1.7B at 30-40% accuracy needs to master easy fundamentals before medium. Phase 1 focuses on the 435 easy questions where success rate is higher → better reward variance → actual learning. Phase 2 adds 32 medium questions.

### 5. Few-shot guided rollouts (ToolExpander approach)
**Effort: high | Value: high | Files: new `training/few_shot_examples.py`, notebook, callback**

Include 1-2 curated example trajectories in the system prompt during early training. Progressively remove them after epoch 1 via a custom callback.

Based on ToolExpander (arXiv:2510.07737): few-shot guidance during RL eliminates training collapse for 1.5B models and reduces hard samples by 15-20%.

Requires:
- Curating 2-3 example trajectories with real environment output
- Modifying system prompt construction in notebook
- Adding `FewShotSchedulerCallback` that removes examples after epoch 1

## Also fix: HF Hub push error

In `notebooks/train_grpo.ipynb` save cell: already fixed — use `model.push_to_hub()` instead of `trainer.push_to_hub()` (incompatible with transformers>=5.x from main).

## Implementation sequence

| Step | What | Deploy together? |
|------|------|-----------------|
| 1 | epochs=3, num_generations=6 | Yes, batch 1 |
| 2 | Post-episode penalty | Yes, batch 1 |
| 3 | Answer format relaxation | Yes, batch 1 |
| — | **Run training, evaluate** | |
| 4 | Curriculum (if still plateauing) | Batch 2 |
| 5 | Few-shot guided rollouts | Batch 3 |

**Recommended**: Deploy steps 1-3 together as one commit, run training, evaluate. If plateau breaks, stop. If not, proceed to step 4.

## Verification

1. **After steps 1-3**: Run training on Colab L4. Check:
   - No OOM with num_generations=6
   - Fewer "Episode is over" loops in completion samples
   - Reward trend shows upward slope (not flat)
   - Loss magnitude increases (more gradient signal)
2. **After step 4**: Check easy-only phase gets >50% accuracy before medium questions are added
3. **After step 5**: Check SQL quality improves on hard questions

## Critical files

| File | Changes |
|------|---------|
| `notebooks/train_grpo.ipynb` | Config: epochs, generations. Curriculum loop (step 4). Few-shot prompt (step 5). |
| `training/trl_adapter.py` | Post-episode penalty in 4 tool methods (lines 239, 259, 281, 300) |
| `server/verifier.py` | `_strip_answer_wrapping()` preprocessing |
| `training/config.py` | Curriculum fields (step 4 only) |
| `training/data_loading.py` | Phase-aware loading (step 4 only) |
| `training/few_shot_examples.py` | New file with curated trajectories (step 5 only) |
| `docs/exploration/rl-vs-icl-research.md` | Already created — reference for F011 |