---
title: RL vs In-Context Learning for Small Model SQL Agents
description: Research synthesis on when GRPO training adds value over pure prompting with in-context learning for sub-2B parameter models
doc_type: exploration
---

# RL vs In-Context Learning for Small Model SQL Agents

Exploration doc for F011 (Prompting Baseline Notebook).

## Context

We train Qwen3-0.6B/1.7B to explore SQL databases using multi-turn tool
calls (describe, query, answer). This doc synthesizes research on when
GRPO training adds value over pure prompting with in-context learning.

## Key Finding: RL wins for small models on multi-turn tool use

For sub-2B models on multi-step tasks, the evidence strongly favors
RL over pure ICL. But ICL can be a strong baseline — and a hybrid
(ICL during RL rollouts) may be optimal.

## When RL (GRPO) beats ICL

### 1. Small models are weak in-context learners

- Fine-tuning advantage over ICL **grows** as model size **shrinks**
  (NeurIPS 2022: "Few-Shot PEFT is Better and Cheaper than ICL")
- Sub-2B models lack pre-training breadth for reliable few-shot extraction
- Context window is precious — few-shot examples consume tokens needed
  for schema descriptions and conversation history

### 2. Multi-turn exploration needs adaptive behavior

- RL teaches error recovery, retry strategies, exploration planning
- Static few-shot examples can't teach "if your query fails, try a
  different approach"
- ToolRL: GRPO gives +17% over base, +15% over SFT on BFCL V3

### 3. Per-query economics favor RL at deployment

| Factor           | ICL              | RL-trained           |
|------------------|------------------|----------------------|
| Training cost    | Zero             | One-time (~2h on L4) |
| Per-query cost   | High (long prompts) | Low (internalized)  |
| Latency          | Higher           | Lower                |

## When ICL is sufficient

- Large models (7B+) on simple/single-turn tool calls
- Prototyping before committing to training infrastructure
- Constrained output space (few tools, fixed schemas)
- When the task is already within the model's pre-training distribution

## The Hybrid Approach (most promising for our case)

### ToolExpander (arXiv:2510.07737)
- Pure GRPO on 1.5B models is unstable and often collapses mid-training
- Fix: **few-shot guided rollouts during RL** — dynamically substitute
  hard samples with few-shot demonstrations
- Eliminated training collapse, reduced hard samples by 15-20%

### ICRL (arXiv:2603.08068)
- Use few-shot prompts **during RL rollouts** but progressively reduce
  them via curriculum learning
- Transition from few-shot to zero-shot over training
- Eliminates need for SFT entirely
- Achieved SOTA on QA and math reasoning

### Implication for SQLEnv
- Start GRPO with 1-2 few-shot examples in the prompt
- As training progresses, remove examples (curriculum)
- The model internalizes the ICL patterns via RL reward signal

## Context Window Considerations

### Qwen3-0.6B/1.7B context limits
- **TODO**: Measure effective context window and performance degradation
- Need to determine: how many few-shot examples fit alongside the
  system prompt, tools, and conversation history?
- "Lost in the Middle" (TACL): even with perfect retrieval, performance
  degrades 13.9%-85% as input length increases
- "Context Length Alone Hurts" (arXiv:2510.05381): degradation is worse
  for smaller models

### Token budget breakdown (estimated)
- System prompt + tools: ~500 tokens
- Question + table hint: ~50 tokens
- Per describe response: ~50 tokens
- Per query response: ~50-200 tokens
- Per few-shot example: ~300-500 tokens
- **Total available**: model context - above overhead

### Research needed for F011
1. Measure Qwen3-0.6B/1.7B effective context window (when does
   performance degrade?)
2. How many few-shot examples fit before hitting context limits?
3. Does the model attend to examples in the middle of the context?
4. What's the minimum ICL example count for reliable tool-calling?

## Training Results Analysis (2026-04-01)

### Qwen3-1.7B, 1 epoch GRPO, 100-example per-turn SFT warmup

**What works:**
- Model learned proper multi-turn tool-calling (describe → query → answer)
- Generates real SQL with JOINs, GROUP BY, ORDER BY, subqueries
- ~30-40% of episodes get correct answers (reward ~1.15)
- GRPO produces gradient signal (advantages range -1.5 to +1.5)

**What doesn't work:**
- Model doesn't stop after answering — keeps calling tools after
  "Episode is over" (wastes step budget)
- SQL quality varies: correct column names sometimes, wrong others
- Answer format mismatches (correct data, wrong format)
- Training loss oscillates near zero (plateau, not improvement)

**Bottleneck hypothesis:** The model can do tool-calling and basic SQL,
but lacks the SQL reasoning to reliably get correct answers. ICL could
help here by showing the reasoning pattern, not just the format.

## Papers Referenced

| Paper | Key finding | Relevance |
|-------|------------|-----------|
| ToolRL (2504.13958) | GRPO +17% over base for tool-calling | Direct comparison |
| ToolExpander (2510.07737) | Few-shot guided GRPO for 1.5B | Stabilizes small model training |
| ICRL (2603.08068) | ICL + RL curriculum, no SFT needed | Hybrid approach |
| RC-GRPO (2602.03025) | SFT creates peaked policy | Explains plateau |
| PEARL (2601.20439) | Plan + explore + RL for multi-hop | Multi-step tool use |
| Bespoke Labs blog | GRPO on Qwen-2.5-7B multi-turn | Industry baseline |
| Lost in the Middle (TACL) | U-shaped context performance | Context window limits |
| Context Length Alone (2510.05381) | Length degrades small models more | ICL overhead |
| Few-Shot PEFT > ICL (NeurIPS 2022) | FT beats ICL for small models | Baseline comparison |
| STAR pipeline (2603.21972) | Smaller models need curriculum | Training design |
| Distil Labs SLM blog | RL helps generative, not structured | Task-dependent |

## Recommendations for F011 (Prompting Baseline Notebook)

### Techniques to test
1. **Zero-shot** — just tools + question, no examples
2. **1-shot** — one complete trajectory example
3. **3-shot** — three diverse examples (different DBs/query patterns)
4. **Chain-of-thought** — add reasoning before tool calls
5. **Context window test** — measure degradation with increasing examples

### Expected results (based on literature)
- Zero-shot on 1.7B: ~5-15% accuracy (model knows tool format from pre-training)
- Few-shot on 1.7B: ~15-25% accuracy (helps with SQL patterns)
- GRPO-trained: ~30-40% accuracy (current results)
- Gap demonstrates RL value proposition

### Metrics to report
- Accuracy per technique
- Average steps used
- Token budget consumed (prompt length)
- SQL quality (valid SQL rate, correct table/column references)