sql_env / docs /exploration /rl-vs-icl-research.md
hjerpe's picture
Upload folder using huggingface_hub
9e64e71 verified
metadata
title: RL vs In-Context Learning for Small Model SQL Agents
description: >-
  Research synthesis on when GRPO training adds value over pure prompting with
  in-context learning for sub-2B parameter models
doc_type: exploration

RL vs In-Context Learning for Small Model SQL Agents

Exploration doc for F011 (Prompting Baseline Notebook).

Context

We train Qwen3-0.6B/1.7B to explore SQL databases using multi-turn tool calls (describe, query, answer). This doc synthesizes research on when GRPO training adds value over pure prompting with in-context learning.

Key Finding: RL wins for small models on multi-turn tool use

For sub-2B models on multi-step tasks, the evidence strongly favors RL over pure ICL. But ICL can be a strong baseline β€” and a hybrid (ICL during RL rollouts) may be optimal.

When RL (GRPO) beats ICL

1. Small models are weak in-context learners

  • Fine-tuning advantage over ICL grows as model size shrinks (NeurIPS 2022: "Few-Shot PEFT is Better and Cheaper than ICL")
  • Sub-2B models lack pre-training breadth for reliable few-shot extraction
  • Context window is precious β€” few-shot examples consume tokens needed for schema descriptions and conversation history

2. Multi-turn exploration needs adaptive behavior

  • RL teaches error recovery, retry strategies, exploration planning
  • Static few-shot examples can't teach "if your query fails, try a different approach"
  • ToolRL: GRPO gives +17% over base, +15% over SFT on BFCL V3

3. Per-query economics favor RL at deployment

Factor ICL RL-trained
Training cost Zero One-time (~2h on L4)
Per-query cost High (long prompts) Low (internalized)
Latency Higher Lower

When ICL is sufficient

  • Large models (7B+) on simple/single-turn tool calls
  • Prototyping before committing to training infrastructure
  • Constrained output space (few tools, fixed schemas)
  • When the task is already within the model's pre-training distribution

The Hybrid Approach (most promising for our case)

ToolExpander (arXiv:2510.07737)

  • Pure GRPO on 1.5B models is unstable and often collapses mid-training
  • Fix: few-shot guided rollouts during RL β€” dynamically substitute hard samples with few-shot demonstrations
  • Eliminated training collapse, reduced hard samples by 15-20%

ICRL (arXiv:2603.08068)

  • Use few-shot prompts during RL rollouts but progressively reduce them via curriculum learning
  • Transition from few-shot to zero-shot over training
  • Eliminates need for SFT entirely
  • Achieved SOTA on QA and math reasoning

Implication for SQLEnv

  • Start GRPO with 1-2 few-shot examples in the prompt
  • As training progresses, remove examples (curriculum)
  • The model internalizes the ICL patterns via RL reward signal

Context Window Considerations

Qwen3-0.6B/1.7B context limits

  • TODO: Measure effective context window and performance degradation
  • Need to determine: how many few-shot examples fit alongside the system prompt, tools, and conversation history?
  • "Lost in the Middle" (TACL): even with perfect retrieval, performance degrades 13.9%-85% as input length increases
  • "Context Length Alone Hurts" (arXiv:2510.05381): degradation is worse for smaller models

Token budget breakdown (estimated)

  • System prompt + tools: ~500 tokens
  • Question + table hint: ~50 tokens
  • Per describe response: ~50 tokens
  • Per query response: ~50-200 tokens
  • Per few-shot example: ~300-500 tokens
  • Total available: model context - above overhead

Research needed for F011

  1. Measure Qwen3-0.6B/1.7B effective context window (when does performance degrade?)
  2. How many few-shot examples fit before hitting context limits?
  3. Does the model attend to examples in the middle of the context?
  4. What's the minimum ICL example count for reliable tool-calling?

Training Results Analysis (2026-04-01)

Qwen3-1.7B, 1 epoch GRPO, 100-example per-turn SFT warmup

What works:

  • Model learned proper multi-turn tool-calling (describe β†’ query β†’ answer)
  • Generates real SQL with JOINs, GROUP BY, ORDER BY, subqueries
  • ~30-40% of episodes get correct answers (reward ~1.15)
  • GRPO produces gradient signal (advantages range -1.5 to +1.5)

What doesn't work:

  • Model doesn't stop after answering β€” keeps calling tools after "Episode is over" (wastes step budget)
  • SQL quality varies: correct column names sometimes, wrong others
  • Answer format mismatches (correct data, wrong format)
  • Training loss oscillates near zero (plateau, not improvement)

Bottleneck hypothesis: The model can do tool-calling and basic SQL, but lacks the SQL reasoning to reliably get correct answers. ICL could help here by showing the reasoning pattern, not just the format.

Papers Referenced

Paper Key finding Relevance
ToolRL (2504.13958) GRPO +17% over base for tool-calling Direct comparison
ToolExpander (2510.07737) Few-shot guided GRPO for 1.5B Stabilizes small model training
ICRL (2603.08068) ICL + RL curriculum, no SFT needed Hybrid approach
RC-GRPO (2602.03025) SFT creates peaked policy Explains plateau
PEARL (2601.20439) Plan + explore + RL for multi-hop Multi-step tool use
Bespoke Labs blog GRPO on Qwen-2.5-7B multi-turn Industry baseline
Lost in the Middle (TACL) U-shaped context performance Context window limits
Context Length Alone (2510.05381) Length degrades small models more ICL overhead
Few-Shot PEFT > ICL (NeurIPS 2022) FT beats ICL for small models Baseline comparison
STAR pipeline (2603.21972) Smaller models need curriculum Training design
Distil Labs SLM blog RL helps generative, not structured Task-dependent

Recommendations for F011 (Prompting Baseline Notebook)

Techniques to test

  1. Zero-shot β€” just tools + question, no examples
  2. 1-shot β€” one complete trajectory example
  3. 3-shot β€” three diverse examples (different DBs/query patterns)
  4. Chain-of-thought β€” add reasoning before tool calls
  5. Context window test β€” measure degradation with increasing examples

Expected results (based on literature)

  • Zero-shot on 1.7B: ~5-15% accuracy (model knows tool format from pre-training)
  • Few-shot on 1.7B: ~15-25% accuracy (helps with SQL patterns)
  • GRPO-trained: ~30-40% accuracy (current results)
  • Gap demonstrates RL value proposition

Metrics to report

  • Accuracy per technique
  • Average steps used
  • Token budget consumed (prompt length)
  • SQL quality (valid SQL rate, correct table/column references)