metadata
title: RL vs In-Context Learning for Small Model SQL Agents
description: >-
Research synthesis on when GRPO training adds value over pure prompting with
in-context learning for sub-2B parameter models
doc_type: exploration
RL vs In-Context Learning for Small Model SQL Agents
Exploration doc for F011 (Prompting Baseline Notebook).
Context
We train Qwen3-0.6B/1.7B to explore SQL databases using multi-turn tool calls (describe, query, answer). This doc synthesizes research on when GRPO training adds value over pure prompting with in-context learning.
Key Finding: RL wins for small models on multi-turn tool use
For sub-2B models on multi-step tasks, the evidence strongly favors RL over pure ICL. But ICL can be a strong baseline β and a hybrid (ICL during RL rollouts) may be optimal.
When RL (GRPO) beats ICL
1. Small models are weak in-context learners
- Fine-tuning advantage over ICL grows as model size shrinks (NeurIPS 2022: "Few-Shot PEFT is Better and Cheaper than ICL")
- Sub-2B models lack pre-training breadth for reliable few-shot extraction
- Context window is precious β few-shot examples consume tokens needed for schema descriptions and conversation history
2. Multi-turn exploration needs adaptive behavior
- RL teaches error recovery, retry strategies, exploration planning
- Static few-shot examples can't teach "if your query fails, try a different approach"
- ToolRL: GRPO gives +17% over base, +15% over SFT on BFCL V3
3. Per-query economics favor RL at deployment
| Factor | ICL | RL-trained |
|---|---|---|
| Training cost | Zero | One-time (~2h on L4) |
| Per-query cost | High (long prompts) | Low (internalized) |
| Latency | Higher | Lower |
When ICL is sufficient
- Large models (7B+) on simple/single-turn tool calls
- Prototyping before committing to training infrastructure
- Constrained output space (few tools, fixed schemas)
- When the task is already within the model's pre-training distribution
The Hybrid Approach (most promising for our case)
ToolExpander (arXiv:2510.07737)
- Pure GRPO on 1.5B models is unstable and often collapses mid-training
- Fix: few-shot guided rollouts during RL β dynamically substitute hard samples with few-shot demonstrations
- Eliminated training collapse, reduced hard samples by 15-20%
ICRL (arXiv:2603.08068)
- Use few-shot prompts during RL rollouts but progressively reduce them via curriculum learning
- Transition from few-shot to zero-shot over training
- Eliminates need for SFT entirely
- Achieved SOTA on QA and math reasoning
Implication for SQLEnv
- Start GRPO with 1-2 few-shot examples in the prompt
- As training progresses, remove examples (curriculum)
- The model internalizes the ICL patterns via RL reward signal
Context Window Considerations
Qwen3-0.6B/1.7B context limits
- TODO: Measure effective context window and performance degradation
- Need to determine: how many few-shot examples fit alongside the system prompt, tools, and conversation history?
- "Lost in the Middle" (TACL): even with perfect retrieval, performance degrades 13.9%-85% as input length increases
- "Context Length Alone Hurts" (arXiv:2510.05381): degradation is worse for smaller models
Token budget breakdown (estimated)
- System prompt + tools: ~500 tokens
- Question + table hint: ~50 tokens
- Per describe response: ~50 tokens
- Per query response: ~50-200 tokens
- Per few-shot example: ~300-500 tokens
- Total available: model context - above overhead
Research needed for F011
- Measure Qwen3-0.6B/1.7B effective context window (when does performance degrade?)
- How many few-shot examples fit before hitting context limits?
- Does the model attend to examples in the middle of the context?
- What's the minimum ICL example count for reliable tool-calling?
Training Results Analysis (2026-04-01)
Qwen3-1.7B, 1 epoch GRPO, 100-example per-turn SFT warmup
What works:
- Model learned proper multi-turn tool-calling (describe β query β answer)
- Generates real SQL with JOINs, GROUP BY, ORDER BY, subqueries
- ~30-40% of episodes get correct answers (reward ~1.15)
- GRPO produces gradient signal (advantages range -1.5 to +1.5)
What doesn't work:
- Model doesn't stop after answering β keeps calling tools after "Episode is over" (wastes step budget)
- SQL quality varies: correct column names sometimes, wrong others
- Answer format mismatches (correct data, wrong format)
- Training loss oscillates near zero (plateau, not improvement)
Bottleneck hypothesis: The model can do tool-calling and basic SQL, but lacks the SQL reasoning to reliably get correct answers. ICL could help here by showing the reasoning pattern, not just the format.
Papers Referenced
| Paper | Key finding | Relevance |
|---|---|---|
| ToolRL (2504.13958) | GRPO +17% over base for tool-calling | Direct comparison |
| ToolExpander (2510.07737) | Few-shot guided GRPO for 1.5B | Stabilizes small model training |
| ICRL (2603.08068) | ICL + RL curriculum, no SFT needed | Hybrid approach |
| RC-GRPO (2602.03025) | SFT creates peaked policy | Explains plateau |
| PEARL (2601.20439) | Plan + explore + RL for multi-hop | Multi-step tool use |
| Bespoke Labs blog | GRPO on Qwen-2.5-7B multi-turn | Industry baseline |
| Lost in the Middle (TACL) | U-shaped context performance | Context window limits |
| Context Length Alone (2510.05381) | Length degrades small models more | ICL overhead |
| Few-Shot PEFT > ICL (NeurIPS 2022) | FT beats ICL for small models | Baseline comparison |
| STAR pipeline (2603.21972) | Smaller models need curriculum | Training design |
| Distil Labs SLM blog | RL helps generative, not structured | Task-dependent |
Recommendations for F011 (Prompting Baseline Notebook)
Techniques to test
- Zero-shot β just tools + question, no examples
- 1-shot β one complete trajectory example
- 3-shot β three diverse examples (different DBs/query patterns)
- Chain-of-thought β add reasoning before tool calls
- Context window test β measure degradation with increasing examples
Expected results (based on literature)
- Zero-shot on 1.7B: ~5-15% accuracy (model knows tool format from pre-training)
- Few-shot on 1.7B: ~15-25% accuracy (helps with SQL patterns)
- GRPO-trained: ~30-40% accuracy (current results)
- Gap demonstrates RL value proposition
Metrics to report
- Accuracy per technique
- Average steps used
- Token budget consumed (prompt length)
- SQL quality (valid SQL rate, correct table/column references)