sql_env / docs /blog-post.md
hjerpe's picture
Upload folder using huggingface_hub
9e64e71 verified

SQLEnv: Teaching Small Models to Explore Databases Like Analysts

What Analysts Do That One-Shot LLMs Miss

A data analyst opens a new database. She doesn't write the final query first. She runs DESCRIBE to see what columns exist, SELECT * LIMIT 5 to understand the data, then builds her query piece by piece — adjusting joins, fixing column names, retrying after errors. The answer emerges from a process, not a guess.

Most text-to-SQL systems skip this process entirely. They take a question and a schema, produce one SQL query, and hope it's right. When it isn't — wrong column name, missing join, bad filter — there's no recovery. The model guessed once and moved on.

SQLEnv is a reinforcement learning environment that trains the exploration habit instead.

The Problem: Static Benchmarks Reward Memorization

Spider and BIRD measure whether a model can produce a correct SQL query in one shot. High accuracy on these benchmarks doesn't mean the model can handle an unfamiliar schema. It means the model memorized patterns from training data that happen to match the test set.

The gap shows up immediately in production. Schemas change. Column names don't match expectations. Tables have unexpected relationships. A model that can't recover from its first wrong guess is brittle in exactly the situations where SQL competence matters most.

The deeper problem: these benchmarks grade the final answer and ignore the process. An agent that explores the schema, discovers a mistake, and corrects itself gets the same score as one that guesses right on the first try — and no credit at all when the guess fails. There's no learning signal for the investigative reasoning that makes human analysts reliable.

SQLEnv: An Interactive RL Environment

SQLEnv gives agents four actions that mirror what analysts do:

  • DESCRIBE — see column names and types for a table
  • SAMPLE — view example rows to understand the data
  • QUERY — run a read-only SQL query and see results
  • ANSWER — submit a final answer

Each episode starts with a natural-language question and a list of table names. The schema is hidden. The agent must discover columns, types, and relationships through exploration before it can answer. This partial observability is the point — it forces the agent to develop strategy rather than pattern-match.

The environment plugs into TRL's environment_factory for GRPO training, runs in Docker for safe SQL execution, and deploys to HuggingFace Spaces.

How an Episode Works

Consider the question: "What is the total bonus given in all evaluations?"

A trained agent's episode looks like this:

  1. describe("evaluation") → sees columns: Employee_ID, Year_awarded, Bonus
  2. query("SELECT SUM(Bonus) FROM evaluation") → result: 19500.0
  3. answer("19500.0") → correct

Three tool calls, clean execution. The agent discovered the schema, wrote the right query, and submitted. But the interesting behavior is what happens when things go wrong.

On a harder question — "Which employee received the biggest bonus?" — the agent tries SELECT Name FROM employee ORDER BY Salary DESC LIMIT 1, gets no such column: Salary, calls describe("evaluation") to find the Bonus column, then writes the correct JOIN. The error recovery is learned behavior, not hard-coded.

Reward Architecture

The environment provides three layers of reward signal:

Operational feedback on every step: did the SQL execute (+0.02) or error? Is this a new query (+0.01) or a repeat (-0.01)?

Progress signal on queries: how close are the results to the correct answer? Measured by value overlap and result cardinality, shaped as delta-from-previous-step. This form is potential-based (Ng et al., 1999), which provably preserves the optimal policy — the agent can't game progress rewards at the expense of correctness.

Terminal reward for the answer: +1.0 for correct, 0.0 for wrong.

Terminal correctness dominates by design. The maximum exploration reward across all steps is roughly 0.3; a correct answer is worth 1.0. An agent that explores endlessly but never answers will always score less than one that answers correctly. Dense intermediate rewards exist to make training feasible on small models — without them, a 0.6B parameter model can't learn from sparse terminal-only signal.

Training with GRPO

We train using Group Relative Policy Optimization (GRPO), which compares multiple rollouts of the same question to compute advantages. TRL's environment_factory runs the agent through SQLEnv for each rollout, accumulating rewards across the multi-turn tool-calling loop.

Training uses a two-phase curriculum:

  • Phase 1: Easy questions only (single-table), with KL penalty (β=0.04) to stay close to the SFT-initialized policy
  • Phase 2: Easy + medium questions (multi-table JOINs), KL penalty removed to allow exploration

The SFT warmup matters more than we expected. We generate 120 multi-turn trajectories showing the full describe→query→answer workflow, trained with assistant_only_loss so the model learns from its own actions, not the environment's responses. Without this, GRPO has no coherent policy to improve — the agent never learns to call tools in sequence.

What We Observed

Across nine training runs on Qwen3-0.6B, three findings stand out.

The environment produces clear learning signal. Phase 1 reward climbs from near-zero to 0.5–0.7 within 400 steps. The model learns to describe tables before querying, format answers correctly (comma-separated lists, pipe-delimited rows), recover from SQL errors by re-describing tables, and submit [] for genuinely empty results. These are strategic behaviors that emerge from reward signal, not from hard-coded rules.

The model hits a capacity ceiling on medium questions. Multi-table JOIN queries — the kind requiring foreign-key chain reasoning across 3+ tables — don't improve with more training. We ran extended training (v2: 4 total epochs across both phases) and the reward curve stayed flat. The dominant failure mode is column name hallucination: the model reads the schema correctly via DESCRIBE, then writes SQL using pretrained column names that don't match. A 0.6B model can't override pretraining biases through RL reward signal alone.

Thinking mode helps error recovery but doesn't raise the ceiling. With Qwen3's thinking mode enabled, the model reasons through SQL errors in <think> blocks — noticing column name mismatches, adjusting join paths. Easy-question accuracy stays the same, but error recovery becomes more deliberate. A new failure mode emerges: ~23% of rollouts degenerate into unclosed <think> loops that consume the entire token budget. The fix is straightforward (add SFT examples with proper think blocks), but it reveals how small models struggle with structural token management.

Training metrics (Run 9, Qwen3-0.6B)

Phase Steps Questions Reward range Mean reward
Phase 1 (easy, β=0.04) 870 435 0.01–1.15 ~0.5
Phase 2 (easy+medium, β=0.0) 934 467 0.01–1.15 ~0.5

Parse rate: >98% (model produces valid tool-call JSON). The 10% eval accuracy on the GRPO checkpoint vs 0% on the base model confirms the environment produces genuine learning, even if the absolute numbers are modest for a 0.6B model.

Technical Highlights

  • 10 Spider databases with structured metadata and a deterministic train/eval split
  • Typed action and observation models make every environment interaction explicit and debuggable
  • Read-only SQL execution via SQLite mode=ro — safety enforced by the database engine, not regex
  • Potential-based reward shaping (Ng et al., 1999) — delta progress rewards provably preserve the optimal policy
  • TRL environment_factory integration — the environment plugs into standard GRPO training with no custom training loop
  • Docker packaging for HuggingFace Spaces with bundled databases and health checks

Future Directions

Two clear paths forward:

Larger models. The 0.6B model's ceiling comes from pretrained column-name biases overriding schema context. A 1.7B or larger model has more capacity to attend to the DESCRIBE output and override pretraining. The environment and reward architecture are model-agnostic — scaling up is a config change, not a redesign.

Thinking mode with targeted SFT. Qwen3's thinking mode shows promise for multi-step reasoning (error recovery, join path planning) but needs SFT coverage on proper <think>...</think> blocks to prevent degenerate loops. Combining a larger model with thinking mode should push into medium-difficulty multi-table questions where the current model plateaus.

The environment itself is the contribution. Whether the agent is 0.6B or 70B, sparse reward or dense, GRPO or PPO — the action space, reward architecture, and episode structure remain the same. SQLEnv provides the training ground. The models will catch up.

Try It Yourself

  • HuggingFace Space: [live demo link]
  • Training notebook: notebooks/train_grpo.ipynb — runs on Colab L4 in ~7 hours for both phases
  • GitHub: full source, architecture docs, and verification artifacts

What We Learned

Dense intermediate rewards accelerate learning only when they align with the final objective. Potential-based shaping (Ng et al., 1999) gives us this guarantee — delta progress rewards can't distort the optimal policy. Without this property, agents learn to farm exploration rewards instead of answering questions.

Tool-using agents benefit from transparent errors. When the environment surfaces "Error: no such column: full_name" instead of silently returning empty results, the agent develops error-recovery strategies. Better diagnostics produce better policy updates.

Multi-turn SFT is the foundation, not a warmup step. Without full-trajectory SFT (describe→query→answer as one example with assistant_only_loss), GRPO has no coherent starting policy to improve. Per-turn SFT teaches individual actions; multi-turn SFT teaches strategy. The difference is the difference between a model that calls describe forever and one that knows when to answer.