--- title: RL vs In-Context Learning for Small Model SQL Agents description: Research synthesis on when GRPO training adds value over pure prompting with in-context learning for sub-2B parameter models doc_type: exploration --- # RL vs In-Context Learning for Small Model SQL Agents Exploration doc for F011 (Prompting Baseline Notebook). ## Context We train Qwen3-0.6B/1.7B to explore SQL databases using multi-turn tool calls (describe, query, answer). This doc synthesizes research on when GRPO training adds value over pure prompting with in-context learning. ## Key Finding: RL wins for small models on multi-turn tool use For sub-2B models on multi-step tasks, the evidence strongly favors RL over pure ICL. But ICL can be a strong baseline — and a hybrid (ICL during RL rollouts) may be optimal. ## When RL (GRPO) beats ICL ### 1. Small models are weak in-context learners - Fine-tuning advantage over ICL **grows** as model size **shrinks** (NeurIPS 2022: "Few-Shot PEFT is Better and Cheaper than ICL") - Sub-2B models lack pre-training breadth for reliable few-shot extraction - Context window is precious — few-shot examples consume tokens needed for schema descriptions and conversation history ### 2. Multi-turn exploration needs adaptive behavior - RL teaches error recovery, retry strategies, exploration planning - Static few-shot examples can't teach "if your query fails, try a different approach" - ToolRL: GRPO gives +17% over base, +15% over SFT on BFCL V3 ### 3. Per-query economics favor RL at deployment | Factor | ICL | RL-trained | |------------------|------------------|----------------------| | Training cost | Zero | One-time (~2h on L4) | | Per-query cost | High (long prompts) | Low (internalized) | | Latency | Higher | Lower | ## When ICL is sufficient - Large models (7B+) on simple/single-turn tool calls - Prototyping before committing to training infrastructure - Constrained output space (few tools, fixed schemas) - When the task is already within the model's pre-training distribution ## The Hybrid Approach (most promising for our case) ### ToolExpander (arXiv:2510.07737) - Pure GRPO on 1.5B models is unstable and often collapses mid-training - Fix: **few-shot guided rollouts during RL** — dynamically substitute hard samples with few-shot demonstrations - Eliminated training collapse, reduced hard samples by 15-20% ### ICRL (arXiv:2603.08068) - Use few-shot prompts **during RL rollouts** but progressively reduce them via curriculum learning - Transition from few-shot to zero-shot over training - Eliminates need for SFT entirely - Achieved SOTA on QA and math reasoning ### Implication for SQLEnv - Start GRPO with 1-2 few-shot examples in the prompt - As training progresses, remove examples (curriculum) - The model internalizes the ICL patterns via RL reward signal ## Context Window Considerations ### Qwen3-0.6B/1.7B context limits - **TODO**: Measure effective context window and performance degradation - Need to determine: how many few-shot examples fit alongside the system prompt, tools, and conversation history? - "Lost in the Middle" (TACL): even with perfect retrieval, performance degrades 13.9%-85% as input length increases - "Context Length Alone Hurts" (arXiv:2510.05381): degradation is worse for smaller models ### Token budget breakdown (estimated) - System prompt + tools: ~500 tokens - Question + table hint: ~50 tokens - Per describe response: ~50 tokens - Per query response: ~50-200 tokens - Per few-shot example: ~300-500 tokens - **Total available**: model context - above overhead ### Research needed for F011 1. Measure Qwen3-0.6B/1.7B effective context window (when does performance degrade?) 2. How many few-shot examples fit before hitting context limits? 3. Does the model attend to examples in the middle of the context? 4. What's the minimum ICL example count for reliable tool-calling? ## Training Results Analysis (2026-04-01) ### Qwen3-1.7B, 1 epoch GRPO, 100-example per-turn SFT warmup **What works:** - Model learned proper multi-turn tool-calling (describe → query → answer) - Generates real SQL with JOINs, GROUP BY, ORDER BY, subqueries - ~30-40% of episodes get correct answers (reward ~1.15) - GRPO produces gradient signal (advantages range -1.5 to +1.5) **What doesn't work:** - Model doesn't stop after answering — keeps calling tools after "Episode is over" (wastes step budget) - SQL quality varies: correct column names sometimes, wrong others - Answer format mismatches (correct data, wrong format) - Training loss oscillates near zero (plateau, not improvement) **Bottleneck hypothesis:** The model can do tool-calling and basic SQL, but lacks the SQL reasoning to reliably get correct answers. ICL could help here by showing the reasoning pattern, not just the format. ## Papers Referenced | Paper | Key finding | Relevance | |-------|------------|-----------| | ToolRL (2504.13958) | GRPO +17% over base for tool-calling | Direct comparison | | ToolExpander (2510.07737) | Few-shot guided GRPO for 1.5B | Stabilizes small model training | | ICRL (2603.08068) | ICL + RL curriculum, no SFT needed | Hybrid approach | | RC-GRPO (2602.03025) | SFT creates peaked policy | Explains plateau | | PEARL (2601.20439) | Plan + explore + RL for multi-hop | Multi-step tool use | | Bespoke Labs blog | GRPO on Qwen-2.5-7B multi-turn | Industry baseline | | Lost in the Middle (TACL) | U-shaped context performance | Context window limits | | Context Length Alone (2510.05381) | Length degrades small models more | ICL overhead | | Few-Shot PEFT > ICL (NeurIPS 2022) | FT beats ICL for small models | Baseline comparison | | STAR pipeline (2603.21972) | Smaller models need curriculum | Training design | | Distil Labs SLM blog | RL helps generative, not structured | Task-dependent | ## Recommendations for F011 (Prompting Baseline Notebook) ### Techniques to test 1. **Zero-shot** — just tools + question, no examples 2. **1-shot** — one complete trajectory example 3. **3-shot** — three diverse examples (different DBs/query patterns) 4. **Chain-of-thought** — add reasoning before tool calls 5. **Context window test** — measure degradation with increasing examples ### Expected results (based on literature) - Zero-shot on 1.7B: ~5-15% accuracy (model knows tool format from pre-training) - Few-shot on 1.7B: ~15-25% accuracy (helps with SQL patterns) - GRPO-trained: ~30-40% accuracy (current results) - Gap demonstrates RL value proposition ### Metrics to report - Accuracy per technique - Average steps used - Token budget consumed (prompt length) - SQL quality (valid SQL rate, correct table/column references)