--- title: "Plan: Break GRPO Training Plateau" description: Ordered interventions to break the GRPO reward plateau on Qwen3-1.7B including curriculum, post-episode penalties, and few-shot guided rollouts doc_type: exploration --- # Plan: Break GRPO Training Plateau ## Context GRPO training on Qwen3-1.7B completed 1 epoch (233 steps, 2.5h on L4) achieving ~30-40% accuracy on easy/medium Spider questions. The model learned multi-turn tool-calling (describe→query→answer) and generates real SQL, but rewards show no upward trend — loss oscillates near zero. Key problems: model doesn't stop after answering (wastes steps), low within-group variance (advantage=0), SQL quality errors, and answer format mismatches. ## Interventions (ordered by effort↑, value↓) ### 1. More epochs: 1→3 + num_generations: 4→6 **Effort: trivial | Value: high | Files: notebook only** Change in `notebooks/train_grpo.ipynb` config cell (`978bc98f`): ```python "Qwen/Qwen3-1.7B": dict( per_device_train_batch_size=1, gradient_accumulation_steps=8, num_generations=6, # was 4 gradient_checkpointing=True, ), ... num_train_epochs=3, # was 1 ``` **Why**: 1 epoch = each question seen once. 3 epochs with 6 generations = 18 rollouts per question. More variance within rollout groups (P(mixed outcomes) goes from 0.81→0.93 at 35% accuracy). Memory safe: generations run in inference mode. **OOM fallback**: reduce `max_new_tokens` from 512→384. ### 2. Penalize post-episode actions **Effort: low | Value: medium | File: `training/trl_adapter.py`** The model keeps calling tools after "Episode is over" — wasting 5+ steps per episode with zero learning signal. Add penalty constant: ```python _POST_EPISODE_PENALTY = -0.1 ``` In each of the 4 tool methods (describe/sample/query/answer), before the existing `raise ValueError`: ```python if self._done: self.reward += _POST_EPISODE_PENALTY # NEW raise ValueError("Episode is over") ``` **Why**: A rollout that answers correctly AND stops (reward 1.15) gets higher advantage than one that answers correctly but wastes 5 steps (reward 1.15 - 0.5 = 0.65). Creates gradient signal for "stop" behavior. ### 3. Relaxed answer matching **Effort: low | Value: medium | File: `server/verifier.py`** Add pre-processing in `verify_answer` to strip common LLM wrapping before type dispatch: - Surrounding quotes: `"42"` → `42` - Markdown code blocks: `` ```42``` `` → `42` - "Answer: " prefix - Extra whitespace The existing type-aware verifier (integer coercion, float tolerance, list parsing) is already good — this catches the remaining format artifacts from the model. ### 4. Curriculum: easy first, then mixed **Effort: medium | Value: high | Files: notebook, `training/config.py`, `training/data_loading.py`** Split training into phases in the notebook: ```python curriculum = [ {"epochs": 1, "difficulties": ["easy"]}, # 435 questions {"epochs": 2, "difficulties": ["easy", "medium"]}, # 467 questions ] ``` Loop over phases, calling `build_trainer` each time with filtered prompts. Model carries forward between phases. No changes to `build_trainer` or `notebook_pipeline.py`. **Why**: 1.7B at 30-40% accuracy needs to master easy fundamentals before medium. Phase 1 focuses on the 435 easy questions where success rate is higher → better reward variance → actual learning. Phase 2 adds 32 medium questions. ### 5. Few-shot guided rollouts (ToolExpander approach) **Effort: high | Value: high | Files: new `training/few_shot_examples.py`, notebook, callback** Include 1-2 curated example trajectories in the system prompt during early training. Progressively remove them after epoch 1 via a custom callback. Based on ToolExpander (arXiv:2510.07737): few-shot guidance during RL eliminates training collapse for 1.5B models and reduces hard samples by 15-20%. Requires: - Curating 2-3 example trajectories with real environment output - Modifying system prompt construction in notebook - Adding `FewShotSchedulerCallback` that removes examples after epoch 1 ## Also fix: HF Hub push error In `notebooks/train_grpo.ipynb` save cell: already fixed — use `model.push_to_hub()` instead of `trainer.push_to_hub()` (incompatible with transformers>=5.x from main). ## Implementation sequence | Step | What | Deploy together? | |------|------|-----------------| | 1 | epochs=3, num_generations=6 | Yes, batch 1 | | 2 | Post-episode penalty | Yes, batch 1 | | 3 | Answer format relaxation | Yes, batch 1 | | — | **Run training, evaluate** | | | 4 | Curriculum (if still plateauing) | Batch 2 | | 5 | Few-shot guided rollouts | Batch 3 | **Recommended**: Deploy steps 1-3 together as one commit, run training, evaluate. If plateau breaks, stop. If not, proceed to step 4. ## Verification 1. **After steps 1-3**: Run training on Colab L4. Check: - No OOM with num_generations=6 - Fewer "Episode is over" loops in completion samples - Reward trend shows upward slope (not flat) - Loss magnitude increases (more gradient signal) 2. **After step 4**: Check easy-only phase gets >50% accuracy before medium questions are added 3. **After step 5**: Check SQL quality improves on hard questions ## Critical files | File | Changes | |------|---------| | `notebooks/train_grpo.ipynb` | Config: epochs, generations. Curriculum loop (step 4). Few-shot prompt (step 5). | | `training/trl_adapter.py` | Post-episode penalty in 4 tool methods (lines 239, 259, 281, 300) | | `server/verifier.py` | `_strip_answer_wrapping()` preprocessing | | `training/config.py` | Curriculum fields (step 4 only) | | `training/data_loading.py` | Phase-aware loading (step 4 only) | | `training/few_shot_examples.py` | New file with curated trajectories (step 5 only) | | `docs/exploration/rl-vs-icl-research.md` | Already created — reference for F011 |