sql_env / docs /blog-post-v1-preview.html
hjerpe's picture
Upload folder using huggingface_hub
99fb2fb verified
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>SQLEnv Blog Post Preview</title>
<style>
:root { --bg: #0d1117; --surface: #161b22; --card: #1a1a2e; --text: #e0e0e0; --muted: #8b949e; --accent: #58a6ff; --border: #30363d; }
* { margin: 0; padding: 0; box-sizing: border-box; }
body { background: var(--bg); color: var(--text); font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif; line-height: 1.7; max-width: 820px; margin: 0 auto; padding: 40px 24px 80px; }
h1 { font-size: 2em; margin: 0 0 32px; color: #fff; line-height: 1.2; }
h2 { font-size: 1.5em; margin: 48px 0 16px; color: #fff; border-bottom: 1px solid var(--border); padding-bottom: 8px; }
h3 { font-size: 1.15em; margin: 32px 0 12px; color: #fff; }
p { margin: 0 0 16px; }
em { color: var(--muted); }
strong { color: #fff; }
a { color: var(--accent); text-decoration: none; }
a:hover { text-decoration: underline; }
code { background: var(--surface); padding: 2px 6px; border-radius: 4px; font-size: 0.9em; font-family: 'SF Mono', 'Fira Code', monospace; }
ul, ol { margin: 0 0 16px 24px; }
li { margin-bottom: 6px; }
li code { font-size: 0.85em; }
table { width: 100%; border-collapse: collapse; margin: 16px 0; }
th { text-align: left; padding: 10px 12px; background: var(--surface); border: 1px solid var(--border); color: #fff; font-size: 0.9em; }
td { padding: 10px 12px; border: 1px solid var(--border); font-size: 0.9em; }
tr.row-grpo { background: rgba(76, 175, 80, 0.1); }
tr.row-grpo td { color: #fff; font-weight: 600; }
tr.row-grpo td:nth-child(2) { color: #81c784; }
tr.row-base td:nth-child(2) { color: #e57373; }
img { max-width: 100%; border-radius: 8px; margin: 16px 0; }
figcaption, .caption { font-size: 0.85em; color: var(--muted); font-style: italic; margin-top: -8px; margin-bottom: 16px; }
/* Copyable code block */
.code-block { position: relative; margin: 16px 0; }
.code-block pre { background: var(--surface); border: 1px solid var(--border); border-radius: 8px; padding: 16px; padding-right: 48px; overflow-x: auto; font-size: 13px; line-height: 1.5; font-family: 'SF Mono', 'Fira Code', monospace; color: #c9d1d9; margin: 0; }
.code-block.scrollable pre { font-size: 12px; }
.reward-comment { color: #81c784; }
.copy-btn { position: absolute; top: 8px; right: 8px; background: var(--border); border: none; border-radius: 4px; padding: 4px 8px; cursor: pointer; color: var(--muted); font-size: 12px; font-family: inherit; transition: background 0.15s, color 0.15s; z-index: 1; }
.copy-btn:hover { background: var(--muted); color: var(--bg); }
.copy-btn.copied { background: #4caf50; color: #fff; }
/* Cards */
.cards { display: flex; gap: 16px; margin: 24px 0; flex-wrap: wrap; }
.card { flex: 1; min-width: 240px; background: var(--card); border-radius: 12px; padding: 24px 20px; display: flex; flex-direction: column; }
.card-header { text-align: center; margin-bottom: 14px; }
.card-header .emoji { font-size: 32px; display: block; margin-bottom: 6px; }
.card-header .title { font-weight: bold; font-size: 1.05em; }
.card-question { color: var(--muted); font-size: 13px; margin-bottom: 14px; text-align: center; height: 36px; display: flex; align-items: center; justify-content: center; }
.card-code { background: var(--bg); border: none; border-radius: 8px; padding: 10px 12px; font-size: 10.5px; margin: 0; line-height: 1.45; font-family: 'SF Mono', 'Fira Code', monospace; color: #c9d1d9; overflow-x: auto; height: 130px; white-space: pre; }
.card-outcome { margin-top: 12px; font-size: 13px; font-weight: 600; height: 24px; display: flex; align-items: center; }
.card-insight { font-size: 12px; color: var(--muted); margin-top: 8px; line-height: 1.5; border-top: 1px solid var(--border); padding-top: 8px; height: 40px; }
.green { color: #4caf50; }
.red { color: #ef5350; }
.blue { color: #4fc3f7; }
.orange { color: #ffb74d; }
.neg { color: #e57373; opacity: 0.85; }
.pos { color: #81c784; opacity: 0.85; }
.marker { font-size: 0.85em; font-weight: 600; }
/* Cold open side-by-side */
.cold-open { display: flex; gap: 16px; margin: 24px 0; flex-wrap: wrap; }
.cold-open-panel { flex: 1; min-width: 340px; background: var(--surface); border-radius: 12px; padding: 20px; border: 1px solid var(--border); }
.cold-open-panel h3 { margin: 0 0 4px; font-size: 1em; border: none; padding: 0; color: #fff; }
.cold-open-panel .subtitle { color: var(--muted); font-size: 13px; margin-bottom: 12px; }
.cold-open-panel pre { margin: 0; padding: 12px; font-size: 10.5px; border: none; background: var(--bg); border-radius: 8px; height: 200px; overflow-y: auto; }
.cold-open-panel .verdict { margin-top: 12px; font-size: 13px; }
.cold-open-panel .verdict .red { color: #e57373; }
.cold-open-panel .verdict .green { color: #81c784; }
/* Color legend */
.legend { display: flex; gap: 16px; margin: 8px 0 4px; font-size: 12px; color: var(--muted); flex-wrap: wrap; }
.legend-item { display: flex; align-items: center; gap: 4px; }
.legend-swatch { width: 10px; height: 10px; border-radius: 2px; display: inline-block; }
</style>
</head>
<body>
<h1>SQLEnv: Teaching Small Models to Explore Databases Like Analysts</h1>
<h2>Untrained vs Trained Agent</h2>
<div class="cold-open">
<div class="cold-open-panel">
<h3>Untrained agent</h3>
<div class="subtitle"><em>"Count the number of paragraphs."</em></div>
<pre><span class="neg">(-)</span> DESCRIBE Documents ×5 ← same table, five times
<span class="neg">(-)</span> SAMPLE Documents ×3 ← already saw these rows
<span class="neg">(-)</span> DESCRIBE Templates ← wrong table
<span class="pos">(+)</span> DESCRIBE Paragraphs ← finally the right table
<span class="neg">(-)</span> QUERY SELECT * LIMIT 5 ← no aggregation
<span class="neg">(-)</span> QUERY SELECT * LIMIT 5 ← still no COUNT(*)
<span class="neg">(-)</span> ANSWER "76 | Robbin CV" ← a random row</pre>
<div class="verdict"><span class="red">15 steps · reward 0.28 · never wrote SELECT COUNT(*)</span></div>
</div>
<div class="cold-open-panel">
<h3>Trained agent</h3>
<div class="subtitle"><em>"Which employee received the biggest bonus?"</em></div>
<pre><span class="pos">(+)</span> DESCRIBE employee
→ Employee_ID, Name, Age, City
<span class="neg">(-)</span> QUERY ...ORDER BY Salary DESC
→ Error: no such column: Salary
<span class="pos">(+)</span> DESCRIBE evaluation
→ Employee_ID, Year_awarded, Bonus
<span class="pos">(+)</span> QUERY ...JOIN...ORDER BY Bonus DESC
→ Louis Deacon
<span class="pos">(+)</span> ANSWER "Louis Deacon"</pre>
<div class="verdict"><span class="green">5 steps · reward 1.13 · recovered from error</span></div>
</div>
</div>
<p>Both agents have the same four tools, the same 15-step budget, and the same databases. Different questions are shown to illustrate the range of behaviors; both agents were evaluated on the same 50-question set in the quantitative comparison below. The untrained agent wastes most of its steps without making progress. The trained agent first explores the schema, then hits an error, adapts, and solves a harder question in a third of the steps.</p>
<h2>The Gap</h2>
<p>Standard Spider-style text-to-SQL (<a href="https://yale-lily.github.io/spider">Yu et al., 2018</a>) gives the model the full schema up front and scores the resulting SQL with exact-match, execution, or test-suite accuracy. This setup rewards memorization. The model never learns to explore a schema or iterate toward a solution, so it struggles on unfamiliar databases with many tables where the full schema cannot fit in context.</p>
<p>SQLEnv takes a different approach. The agent progressively discovers the schema through its own actions: it starts with only table names and must call DESCRIBE, SAMPLE, and QUERY to reveal columns, types, and relationships within a fixed step budget. This is a POMDP (partially observable Markov decision process) where the agent acts under uncertainty, which makes exploration necessary and learnable.</p>
<h2>What Analysts Actually Do</h2>
<p>Consider the situation where you need to answer a question using data in an unfamiliar database. You probably cannot write the final query in one go. Instead, you run <code>DESCRIBE</code> to see what columns exist, <code>SELECT * LIMIT 5</code> to scan the actual data, then build your query piece by piece, adjusting joins, fixing column names, and retrying after errors. The answer emerges from iteration.</p>
<p>SQLEnv captures this workflow. Four actions mirror what analysts do:</p>
<ul>
<li><strong>DESCRIBE</strong> reveals column names and types for a table</li>
<li><strong>SAMPLE</strong> previews rows to understand the data (available but rarely used by the trained agent, which learned to rely on DESCRIBE and QUERY)</li>
<li><strong>QUERY</strong> executes a read-only SQL query</li>
<li><strong>ANSWER</strong> submits a final answer</li>
</ul>
<p>Each episode starts with a natural-language question and a list of table names. Columns, types, and relationships stay hidden until the agent discovers them through exploration. This partial observability forces strategy over pattern-matching.</p>
<p>A clean episode on the question <em>"List student IDs with registered courses and their course counts"</em>:</p>
<div class="code-block scrollable">
<button class="copy-btn" onclick="copyBlock(this)">Copy</button>
<pre>Step 1: DESCRIBE student_course_registrations
→ student_id INTEGER, course_id INTEGER, registration_date DATETIME
→ reward: +0.015 <span class="reward-comment">← new schema revealed</span>
Step 2: DESCRIBE students
→ student_id INTEGER, student_details VARCHAR(255)
→ reward: +0.015 <span class="reward-comment">← second table described</span>
Step 3: QUERY SELECT T1.student_id, count(*)
FROM students AS T1
JOIN student_course_registrations AS T2
ON T1.student_id = T2.student_id
GROUP BY T1.student_id
→ 111|1, 121|2, 131|1, 141|2, 151|1, 161|1, 171|1
→ reward: +0.150 <span class="reward-comment">← results overlap with gold answer</span>
Step 4: ANSWER [[111,1],[121,2],[131,1],[141,2],[151,1],[161,1],[171,1]]
→ correct
→ reward: +1.000 <span class="reward-comment">← terminal: correct answer</span></pre>
</div>
<p>Total reward: 1.180. Four steps. Exploration: 0.180, terminal: 1.000.</p>
<h2>Built on OpenEnv</h2>
<p><a href="https://github.com/meta-pytorch/OpenEnv">OpenEnv</a> provides a Gymnasium-style interface for agentic RL environments. The contract is simple:</p>
<ul>
<li><code>reset(seed)</code> starts a new episode and returns the initial observation</li>
<li><code>step(action)</code> executes one action and returns observation, reward, and done flag</li>
</ul>
<p>Pydantic models enforce typed contracts between agent and environment. We tested integration with TRL's <code>environment_factory</code> for GRPO training. OpenEnv is also designed to work with torchforge and Unsloth, though we have not tested those integrations. SQLEnv implements the interface with four actions (DESCRIBE, SAMPLE, QUERY, ANSWER):</p>
<div class="legend">
<div class="legend-item"><span class="legend-swatch" style="background:#d2a8ff;"></span> method</div>
<div class="legend-item"><span class="legend-swatch" style="background:#7ee787;"></span> action type</div>
<div class="legend-item"><span class="legend-swatch" style="background:#ffa657;"></span> argument</div>
</div>
<div class="code-block">
<button class="copy-btn" onclick="copyBlock(this)">Copy</button>
<pre style="margin-top:0">env = SQLEnvironment(questions_path=<span style="color:#a5d6ff">"..."</span>, db_dir=<span style="color:#a5d6ff">"..."</span>, tokenizer=tok)
obs = env.<span style="color:#d2a8ff">reset</span>(seed=42)
obs = env.<span style="color:#d2a8ff">step</span>(SQLAction(action_type=<span style="color:#7ee787">"DESCRIBE"</span>, argument=<span style="color:#ffa657">"employee"</span>))
obs = env.<span style="color:#d2a8ff">step</span>(SQLAction(action_type=<span style="color:#7ee787">"QUERY"</span>, argument=<span style="color:#ffa657">"SELECT COUNT(*) FROM employee"</span>))
obs = env.<span style="color:#d2a8ff">step</span>(SQLAction(action_type=<span style="color:#7ee787">"ANSWER"</span>, argument=<span style="color:#ffa657">"10"</span>))</pre>
</div>
<p>TRL's <code>environment_factory</code> auto-discovers the four tool methods from the environment class for GRPO training. The same environment runs locally, in Docker on HuggingFace Spaces, or over WebSocket.</p>
<p>The Green Agent evaluator wraps this protocol for benchmarking:</p>
<div class="code-block">
<pre>evaluate(env, policy, n_episodes=50, seed=0)</pre>
</div>
<p>This runs any <code>Policy</code> through the environment and reports success rate, average reward, and step count. Built-in <code>RandomPolicy</code> and <code>OraclePolicy</code> baselines provide lower and upper bounds (0% vs 100% accuracy, 0.25 vs 1.17 reward).</p>
<h2>Reward Architecture</h2>
<p>Three layers of reward signal:</p>
<table>
<tr><th>Layer</th><th>Signal</th><th>Per-step clip</th></tr>
<tr>
<td><strong>L1: Operational</strong></td>
<td>Successful execution <span class="pos">(+0.02)</span>, new info <span class="pos">(+0.01)</span>, repeat <span class="neg">(-0.03)</span>, step cost <span class="neg">(-0.02)</span></td>
<td style="white-space:nowrap">[-0.10, 0.15]</td>
</tr>
<tr>
<td><strong>L2: Progress</strong></td>
<td>Delta from previous query result — cardinality, value overlap, numeric proximity. Positive <span class="pos">(+)</span> for improvement, negative <span class="neg">(-)</span> for regression.</td>
<td style="white-space:nowrap">[-0.10, 0.15]</td>
</tr>
<tr>
<td><strong>L3: Terminal</strong></td>
<td>Correct answer: <span class="pos">+1.0</span>. Wrong: <span class="neg">0.0</span></td>
<td style="white-space:nowrap">one-shot</td>
</tr>
</table>
<p>Terminal correctness dominates. Maximum exploration reward across 15 steps is ~0.3, while a correct answer adds 1.0. An agent that explores but never answers always scores below one that answers correctly. Prior work on tool-using agents suggests that dense intermediate rewards are important for training small models (see TIPS, ToolRL, StepTool below). We did not ablate this by testing terminal-only reward at 0.6B parameters, but the progressive reward signal let us verify that the agent was learning the right strategic patterns: reward climbed from -0.1 to 0.5-0.7 as the agent shifted from random tool calls to describe-then-query-then-answer sequences.</p>
<p>The L2 progress component is inspired by potential-based reward shaping (<a href="https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/NgHaradaRussell-shaping-ICML1999.pdf">Ng et al., 1999</a>). The raw delta follows the potential-difference form (phi(s') - phi(s)), but the combined reward includes binning and per-step clipping for training stability, which departs from the strict theoretical guarantee. We confirmed empirically that the potential-difference structure matters: cumulative progress caps (not potential-based) caused the agent to explore endlessly and never answer.</p>
<p>Note that the L2 progress reward uses the gold answer (executed gold SQL rows) as a training-time verifier signal. This supervision is computed server-side and never visible to the agent, but it means the dense shaping relies on ground truth that is not available at inference.</p>
<p>Recent work supports dense shaping for tool-using agents. <a href="https://arxiv.org/abs/2603.22293">TIPS</a> (2026) reported 11.8% EM gains over PPO baselines in search-augmented multi-turn QA. <a href="https://arxiv.org/html/2504.13958v1">ToolRL</a> (2025) found 17% improvement over base models and 15% over SFT models through principled reward design for tool learning. <a href="https://arxiv.org/abs/2410.07745">StepTool</a> (2024) found step-grained shaping outperformed outcome-only rewards in tool-use benchmarks.</p>
<h2>Training</h2>
<p>We train <a href="https://arxiv.org/abs/2505.09388">Qwen3-0.6B</a> with Group Relative Policy Optimization (<a href="https://arxiv.org/abs/2402.03300">GRPO</a>, from DeepSeekMath). TRL's <code>environment_factory</code> runs the agent through SQLEnv for each rollout, comparing multiple rollouts per question to compute advantages.</p>
<p>SFT warmup proved critical. Per-turn SFT (347 examples, each one assistant turn) taught the model to call describe forever. Half the training examples were describe calls, so the model learned "when asked a question, call describe." When we applied a KL penalty during RL, every rollout stayed identical to this SFT behavior. The advantage between rollouts was zero, so no policy gradient could form.</p>
<p>Multi-turn SFT (120 full trajectories with <code>assistant_only_loss</code>) taught describe-then-query-then-answer as a coherent strategy. The subsequent GRPO training refined this into error recovery, answer formatting, and knowing when to stop exploring.</p>
<p>Two-phase curriculum:</p>
<ul>
<li><strong>Phase 1</strong>: Easy questions (single-table), KL penalty (beta=0.04) to keep the policy close to SFT initialization while allowing exploration. Reward climbs from -0.1 to 0.5-0.7 over 400 steps.</li>
<li><strong>Phase 2</strong>: Easy + medium (multi-table JOINs), KL removed (beta=0) so the agent can deviate further from SFT and discover new strategies. Reward holds at ~0.5.</li>
</ul>
<div style="text-align: center; margin: 24px 0;">
<img src="/Users/hjerp/Projects/sql-env/docs/rl-training-phase-1.png" alt="GRPO Training Progress" style="max-width: 100%; border-radius: 8px;">
<p class="caption" style="text-align: center;">Batch-mean reward across Phase 1 (easy, beta=0.04) and Phase 2 (easy+medium, beta=0). Reward starts negative and climbs to 0.5-0.7 as the agent learns describe-then-query-then-answer. Phase 2 holds at ~0.5. Individual episodes that solve the question correctly receive up to 1.15 total reward, but the batch mean is lower because most rollouts within a batch include incorrect attempts. 902 steps, ~4.75h on Colab L4.</p>
</div>
<p>SFT warmup takes ~1 minute (60 steps, loss drops from 1.6 to 0.08 in 2 epochs). GRPO Phase 1 runs ~2.25h, Phase 2 ~2.5h. Total pipeline: ~5 hours on a single Colab L4 (24GB VRAM), in one notebook session.</p>
<h2>What the Agent Learned</h2>
<p>The following behaviors emerged during training:</p>
<div class="cards">
<div class="card">
<div class="card-header">
<span class="emoji">🔎</span>
<span class="title blue">Schema Discovery</span>
</div>
<div class="card-question"><em>"What is the total bonus in all evaluations?"</em></div>
<div class="card-code">describe("evaluation")
→ Employee_ID, Year_awarded, Bonus
query("SELECT SUM(Bonus) FROM evaluation")
→ 19500.0
answer("19500.0")</div>
<div class="card-outcome"><span class="green">3 steps · reward 1.15 · correct</span></div>
<div class="card-insight">Aggregation on one table. Describe first, then write targeted SQL.</div>
</div>
<div class="card">
<div class="card-header">
<span class="emoji">🔧</span>
<span class="title orange">Error Recovery</span>
</div>
<div class="card-question"><em>"Which employee received the biggest bonus?"</em></div>
<div class="card-code">describe("employee") → Name, Age
query("...ORDER BY Salary DESC")
→ Error: no such column: Salary
describe("evaluation") → Bonus
query("...JOIN...ORDER BY Bonus DESC")
→ Louis Deacon
answer("Louis Deacon")</div>
<div class="card-outcome"><span class="green">5 steps · reward 1.13 · correct</span></div>
<div class="card-insight">Two-table JOIN with error recovery. Wrong column, re-describe, retry.</div>
</div>
<div class="card">
<div class="card-header">
<span class="emoji">🚧</span>
<span class="title red">The Ceiling</span>
</div>
<div class="card-question"><em>"Which city has the most arriving flights?"</em></div>
<div class="card-code">describe("AIRPORTS")
→ City, AirportCode
query("SELECT CITY, COUNT(*)
FROM AIRPORTS GROUP BY CITY
ORDER BY COUNT(*) DESC LIMIT 1")
→ Albany | 4
answer("Albany")</div>
<div class="card-outcome"><span class="red">3 steps · reward 0.0 · incorrect</span></div>
<div class="card-insight">Multi-hop JOIN (3+ tables). Counted airports, not flights. Beyond 0.6B.</div>
</div>
</div>
<p>The first two cards show learned behaviors: schema-first exploration and error recovery. The third shows where 0.6B saturates. We expand on these limitations and next steps below.</p>
<h3>Evaluation (N=50 episodes, 2 independent runs)</h3>
<p>All conditions run through SQLEnv's Green Agent evaluator: <code>evaluate(env, policy, n_episodes, seed)</code>.</p>
<table>
<tr><th>Method</th><th>Accuracy</th><th>Parse Rate</th><th>Avg Steps</th></tr>
<tr class="row-base"><td>Zero-shot</td><td>0%</td><td>24-28%</td><td>10.8-12.4</td></tr>
<tr class="row-base"><td>1-shot</td><td>0-2%</td><td>16-17%</td><td>14.0-14.8</td></tr>
<tr class="row-base"><td>3-shot</td><td>0%</td><td>19-20%</td><td>13.8-14.8</td></tr>
<tr class="row-grpo"><td>GRPO v1 (2 epochs)</td><td>28-30%</td><td>95-100%</td><td>3.5-4.0</td></tr>
<tr class="row-grpo"><td>GRPO v2 (4 epochs)</td><td>24-32%</td><td>87-95%</td><td>3.5-4.0</td></tr>
</table>
<p><strong>Parse rate</strong>: the trained model (v1) produces valid tool-call JSON 95-100% of the time. The base model fails 76-83% of the time and burns its step budget repeating malformed output. <strong>Accuracy</strong>: the base model cannot answer a single question even with 3 examples, but the trained model solves 14-15 out of 50 on this curated Spider subset.</p>
<p>We trained two GRPO checkpoints: v1 (2 epochs) and v2 (4 epochs). Across two evaluation runs each, v1 scored 28-30% accuracy with 95-100% parse rate, while v2 scored 24-32% with 87-95% parse rate. The run-to-run variation (6-8 percentage points) at N=50 makes checkpoint-to-checkpoint differences hard to interpret. Extended training also introduced an abstention pattern: v2 sometimes outputs "Task complete" instead of calling answer() on uncertain questions, which increases parse failures but may reflect learned caution. On this subset, additional RL training did not improve accuracy, which indicates that the bottleneck is the model's 0.6B pretraining rather than the training budget.</p>
<h2>Limitations at 0.6B Parameters</h2>
<p>On this curated 10-database Spider subset, three failure modes define the current ceiling:</p>
<ul>
<li><strong>Column name hallucination.</strong> The model reads <code>FullName</code> from DESCRIBE but writes <code>full_name</code> in SQL, or reads <code>Horsepower</code> and writes <code>HorsepowerDESC</code> (missing space). Pretraining biases override the schema that the model just observed in context.</li>
<li><strong>FK chain reasoning.</strong> The model handles single-table queries well but fails on three-table JOINs such as Documents → Templates → Ref_Template_Types. It cannot chain foreign keys through intermediate tables.</li>
<li><strong>More RL does not help.</strong> Extended training (v2: 4 total epochs) produced identical accuracy. This indicates the ceiling comes from pretraining knowledge rather than training budget.</li>
</ul>
<p>On this subset, RL drives accuracy from 0% to ~30% but saturates at 0.6B capacity. The eval set is easy-heavy (91% single/two-table questions, no hard questions) and uses N=50 episodes per condition, so these results should not be generalized beyond this setup. Train and eval use mostly separate databases, with one schema (flight_2) appearing in both. We did not explore whether SFT on multi-table reasoning or structured thinking before JOINs could push past this ceiling in our current work. We discuss possible directions in the Next Steps section.</p>
<p>This evaluation is not comparable to the official Spider leaderboard, which uses different scoring (test-suite accuracy), full-schema input, and a broader database set.</p>
<h2>The Learning Space</h2>
<p>The Green Agent evaluator brackets the learning space with two baselines:</p>
<table>
<tr><th>Policy</th><th>Accuracy</th><th>Avg Reward</th><th>Avg Steps</th></tr>
<tr class="row-base"><td>Random</td><td>0%</td><td>0.247</td><td>15.0</td></tr>
<tr class="row-grpo"><td>GRPO (trained)</td><td>~30%</td><td>~0.35</td><td>3.5</td></tr>
<tr><td>Oracle</td><td>100%</td><td>1.168</td><td>3.5</td></tr>
</table>
<p>Random scores 0.247 by accumulating small exploration rewards across 15 steps without answering. Oracle scores 1.168 in 3.5 steps. This gap between 0.25 and 1.17 represents what a trained agent can learn. Our GRPO agent lands at ~0.35, above random but far below oracle, with room for improvement through better SFT warmup or larger models.</p>
<h2>Technical Highlights</h2>
<ul>
<li><strong>676 questions</strong> (473 train / 203 eval) across 10 Spider databases with difficulty labels</li>
<li><strong>Typed models</strong> with Pydantic: every action, observation, and state is explicit and debuggable</li>
<li><strong>Read-only SQL</strong> via SQLite <code>mode=ro</code>, where the database engine enforces safety rather than regex</li>
<li><strong>Reward shaping</strong> inspired by potential-based methods (Ng et al., 1999), with practical modifications for training stability</li>
<li><strong>TRL environment_factory</strong> integration for standard GRPO training without a custom loop</li>
<li><strong>Green Agent evaluator</strong> with <code>Policy</code> protocol, <code>evaluate()</code> harness, and <code>RandomPolicy</code>/<code>OraclePolicy</code> baselines</li>
</ul>
<h2>Next Steps</h2>
<p>The environment supports two directions for improvement:</p>
<p><strong>Thinking mode.</strong> The 30% ceiling comes from multi-table reasoning. The model cannot plan a three-table JOIN path before writing SQL. Qwen3's <code>&lt;think&gt;</code> blocks offer a way to reason about the join chain before writing the query. In our experiments, RL alone did not produce useful thinking: the model either emitted empty <code>&lt;think&gt;&lt;/think&gt;</code> blocks or collapsed into degenerate loops (<code>&lt;think&gt;assistant&lt;think&gt;assistant...</code>) that consumed ~23% of rollouts. Pure RL discovers that thinking tokens exist but not how to use them. SFT warmup with structured reasoning examples ("I need to join Documents → Templates → Ref_Template_Types through Template_ID") could bootstrap the format, then RL could refine when to think and when to skip. This is worth testing at 0.6B before concluding the ceiling requires a larger model.</p>
<p><strong>Larger models.</strong> Our goal is small models that run locally, so scaling to 7B or beyond changes the deployment story. That said, a 1.7B model has more capacity to attend to DESCRIBE output and override pretrained column names. The environment and reward architecture do not depend on model size, so scaling up requires changing the training configuration rather than redesigning the environment. At some point, larger models may solve these questions with few-shot prompting alone, but the environment remains useful for training small models that need to run without API access.</p>
<h2>Try It Yourself</h2>
<ul>
<li><strong>Training notebook</strong>: <code>notebooks/train_grpo.ipynb</code> runs the full SFT + GRPO pipeline on Colab L4 in ~7 hours</li>
<li><strong>Comparison notebook</strong>: <code>notebooks/compare_methods.ipynb</code> evaluates base vs trained models side by side</li>
<li><strong>Showcase notebook</strong>: <code>notebooks/showcase_sqlenv.ipynb</code> lets you explore the environment, run episodes, and see what tools and rewards are available</li>
<li><strong>GitHub</strong>: full source, architecture docs, and training artifacts</li>
</ul>
<h2>Discussion</h2>
<p><strong>The format of SFT data matters more than the quantity.</strong> Per-turn SFT (347 examples) taught the model individual tool calls but not when to use them. The model called describe repeatedly because half the training examples were describe calls. Multi-turn SFT (120 full trajectories) taught the model to chain describe, query, and answer into a coherent episode. The difference was not the number of examples but whether each example showed a complete problem-solving sequence.</p>
<p><strong>Transparent errors help the agent learn.</strong> When the environment returns <code>"Error: no such column: full_name"</code> instead of empty results, the agent can develop error-recovery strategies. Informative error messages give the RL training signal something to work with.</p>
<p><strong>Dense rewards benefit from theoretical grounding.</strong> Potential-based shaping (Ng et al., 1999) provides the theoretical foundation for our reward design, though our implementation includes practical modifications (binning, clipping) that depart from the strict potential-difference form. Without some form of dense shaping, we observed agents accumulating exploration rewards instead of answering questions. Recent work supports this direction: TIPS (2026) reported gains over PPO baselines in multi-turn QA, ToolRL (2025) found improvements through principled reward decomposition, and StepTool (2024) found step-grained shaping outperformed outcome-only rewards. These results suggest that principled reward design is important for multi-turn environments.</p>
<p><strong>The environment is the contribution.</strong> The action space, reward function, and episode structure do not depend on the choice of model or RL algorithm. SQLEnv targets small models that need to learn database exploration through training, since larger models can often handle these tasks with few-shot prompting alone. As newer small language models become available, the environment provides a training ground for teaching them iterative reasoning.</p>
<script>
function copyBlock(btn) {
const pre = btn.parentElement.querySelector('pre');
const text = pre.innerText;
navigator.clipboard.writeText(text).then(() => {
btn.textContent = 'Copied!';
btn.classList.add('copied');
setTimeout(() => {
btn.textContent = 'Copy';
btn.classList.remove('copied');
}, 2000);
});
}
</script>
</body>
</html>