Spaces:

hjerpe
/

sql_env

Running

App Files Files Community

sql_env / docs /exploration /rl-vs-icl-research.md

hjerpe

Upload folder using huggingface_hub

9e64e71 verified 8 days ago

preview code

raw

history blame contribute delete

6.78 kB

metadata

title: RL vs In-Context Learning for Small Model SQL Agents
description: >-
  Research synthesis on when GRPO training adds value over pure prompting with
  in-context learning for sub-2B parameter models
doc_type: exploration

RL vs In-Context Learning for Small Model SQL Agents

Exploration doc for F011 (Prompting Baseline Notebook).

Context

We train Qwen3-0.6B/1.7B to explore SQL databases using multi-turn tool calls (describe, query, answer). This doc synthesizes research on when GRPO training adds value over pure prompting with in-context learning.

Key Finding: RL wins for small models on multi-turn tool use

For sub-2B models on multi-step tasks, the evidence strongly favors RL over pure ICL. But ICL can be a strong baseline — and a hybrid (ICL during RL rollouts) may be optimal.

When RL (GRPO) beats ICL

1. Small models are weak in-context learners

Fine-tuning advantage over ICL grows as model size shrinks (NeurIPS 2022: "Few-Shot PEFT is Better and Cheaper than ICL")
Sub-2B models lack pre-training breadth for reliable few-shot extraction
Context window is precious — few-shot examples consume tokens needed for schema descriptions and conversation history

2. Multi-turn exploration needs adaptive behavior

RL teaches error recovery, retry strategies, exploration planning
Static few-shot examples can't teach "if your query fails, try a different approach"
ToolRL: GRPO gives +17% over base, +15% over SFT on BFCL V3

3. Per-query economics favor RL at deployment

Factor	ICL	RL-trained
Training cost	Zero	One-time (~2h on L4)
Per-query cost	High (long prompts)	Low (internalized)
Latency	Higher	Lower

When ICL is sufficient

Large models (7B+) on simple/single-turn tool calls
Prototyping before committing to training infrastructure
Constrained output space (few tools, fixed schemas)
When the task is already within the model's pre-training distribution

The Hybrid Approach (most promising for our case)

ToolExpander (arXiv:2510.07737)

Pure GRPO on 1.5B models is unstable and often collapses mid-training
Fix: few-shot guided rollouts during RL — dynamically substitute hard samples with few-shot demonstrations
Eliminated training collapse, reduced hard samples by 15-20%

ICRL (arXiv:2603.08068)

Use few-shot prompts during RL rollouts but progressively reduce them via curriculum learning
Transition from few-shot to zero-shot over training
Eliminates need for SFT entirely
Achieved SOTA on QA and math reasoning

Implication for SQLEnv

Start GRPO with 1-2 few-shot examples in the prompt
As training progresses, remove examples (curriculum)
The model internalizes the ICL patterns via RL reward signal

Context Window Considerations

Qwen3-0.6B/1.7B context limits

TODO: Measure effective context window and performance degradation
Need to determine: how many few-shot examples fit alongside the system prompt, tools, and conversation history?
"Lost in the Middle" (TACL): even with perfect retrieval, performance degrades 13.9%-85% as input length increases
"Context Length Alone Hurts" (arXiv:2510.05381): degradation is worse for smaller models

Token budget breakdown (estimated)

System prompt + tools: ~500 tokens
Question + table hint: ~50 tokens
Per describe response: ~50 tokens
Per query response: ~50-200 tokens
Per few-shot example: ~300-500 tokens
Total available: model context - above overhead

Research needed for F011

Measure Qwen3-0.6B/1.7B effective context window (when does performance degrade?)
How many few-shot examples fit before hitting context limits?
Does the model attend to examples in the middle of the context?
What's the minimum ICL example count for reliable tool-calling?

Training Results Analysis (2026-04-01)

Qwen3-1.7B, 1 epoch GRPO, 100-example per-turn SFT warmup

What works:

Model learned proper multi-turn tool-calling (describe → query → answer)
Generates real SQL with JOINs, GROUP BY, ORDER BY, subqueries
~30-40% of episodes get correct answers (reward ~1.15)
GRPO produces gradient signal (advantages range -1.5 to +1.5)

What doesn't work:

Model doesn't stop after answering — keeps calling tools after "Episode is over" (wastes step budget)
SQL quality varies: correct column names sometimes, wrong others
Answer format mismatches (correct data, wrong format)
Training loss oscillates near zero (plateau, not improvement)

Bottleneck hypothesis: The model can do tool-calling and basic SQL, but lacks the SQL reasoning to reliably get correct answers. ICL could help here by showing the reasoning pattern, not just the format.

Papers Referenced

Paper	Key finding	Relevance
ToolRL (2504.13958)	GRPO +17% over base for tool-calling	Direct comparison
ToolExpander (2510.07737)	Few-shot guided GRPO for 1.5B	Stabilizes small model training
ICRL (2603.08068)	ICL + RL curriculum, no SFT needed	Hybrid approach
RC-GRPO (2602.03025)	SFT creates peaked policy	Explains plateau
PEARL (2601.20439)	Plan + explore + RL for multi-hop	Multi-step tool use
Bespoke Labs blog	GRPO on Qwen-2.5-7B multi-turn	Industry baseline
Lost in the Middle (TACL)	U-shaped context performance	Context window limits
Context Length Alone (2510.05381)	Length degrades small models more	ICL overhead
Few-Shot PEFT > ICL (NeurIPS 2022)	FT beats ICL for small models	Baseline comparison
STAR pipeline (2603.21972)	Smaller models need curriculum	Training design
Distil Labs SLM blog	RL helps generative, not structured	Task-dependent

Recommendations for F011 (Prompting Baseline Notebook)

Techniques to test

Zero-shot — just tools + question, no examples
1-shot — one complete trajectory example
3-shot — three diverse examples (different DBs/query patterns)
Chain-of-thought — add reasoning before tool calls
Context window test — measure degradation with increasing examples

Expected results (based on literature)

Zero-shot on 1.7B: ~5-15% accuracy (model knows tool format from pre-training)
Few-shot on 1.7B: ~15-25% accuracy (helps with SQL patterns)
GRPO-trained: ~30-40% accuracy (current results)
Gap demonstrates RL value proposition

Metrics to report

Accuracy per technique
Average steps used
Token budget consumed (prompt length)
SQL quality (valid SQL rate, correct table/column references)