hjerpe commited on
Commit
99fb2fb
·
verified ·
1 Parent(s): 9e64e71

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -4,6 +4,7 @@ emoji: 🤖
4
  colorFrom: blue
5
  colorTo: green
6
  sdk: docker
 
7
  pinned: false
8
  base_path: /web
9
  ---
 
4
  colorFrom: blue
5
  colorTo: green
6
  sdk: docker
7
+ app_port: 8000
8
  pinned: false
9
  base_path: /web
10
  ---
docs/blog-post-v1-preview.html CHANGED
@@ -112,11 +112,11 @@
112
 
113
  </div>
114
 
115
- <p>Both agents have the same four tools, the same 15-step budget, and the same databases. The untrained agent wastes most of its steps without making progress. The trained agent first explores the schema, then hits an error, adapts, and solves a harder question in a third of the steps.</p>
116
 
117
  <h2>The Gap</h2>
118
 
119
- <p>Standard text-to-SQL evaluation works like this: hand the model a complete schema (all tables, all columns, all relationships) and ask it to produce one SQL query. If the query matches the gold answer, it passes. This setup rewards memorization. The model never learns to explore a schema or iterate toward a solution, so it struggles on unfamiliar databases with many tables where the full schema cannot fit in context.</p>
120
 
121
  <p>SQLEnv takes a different approach. The agent progressively discovers the schema through its own actions: it starts with only table names and must call DESCRIBE, SAMPLE, and QUERY to reveal columns, types, and relationships within a fixed step budget. This is a POMDP (partially observable Markov decision process) where the agent acts under uncertainty, which makes exploration necessary and learnable.</p>
122
 
@@ -128,7 +128,7 @@
128
 
129
  <ul>
130
  <li><strong>DESCRIBE</strong> reveals column names and types for a table</li>
131
- <li><strong>SAMPLE</strong> previews rows to understand the data</li>
132
  <li><strong>QUERY</strong> executes a read-only SQL query</li>
133
  <li><strong>ANSWER</strong> submits a final answer</li>
134
  </ul>
@@ -164,14 +164,14 @@ Step 4: ANSWER [[111,1],[121,2],[131,1],[141,2],[151,1],[161,1],[171,1]]
164
 
165
  <h2>Built on OpenEnv</h2>
166
 
167
- <p><a href="https://github.com/meta-pytorch/OpenEnv">OpenEnv</a> is a standard protocol for RL environments with a simple contract:</p>
168
 
169
  <ul>
170
  <li><code>reset(seed)</code> starts a new episode and returns the initial observation</li>
171
  <li><code>step(action)</code> executes one action and returns observation, reward, and done flag</li>
172
  </ul>
173
 
174
- <p>Pydantic models enforce typed contracts between agent and environment. Any environment that implements this protocol plugs into TRL, torchforge, and Unsloth without glue code. SQLEnv implements it with four actions (DESCRIBE, SAMPLE, QUERY, ANSWER):</p>
175
 
176
  <div class="legend">
177
  <div class="legend-item"><span class="legend-swatch" style="background:#d2a8ff;"></span> method</div>
@@ -223,11 +223,15 @@ obs = env.<span style="color:#d2a8ff">step</span>(SQLAction(action_type=<span st
223
 
224
  <p>Terminal correctness dominates. Maximum exploration reward across 15 steps is ~0.3, while a correct answer adds 1.0. An agent that explores but never answers always scores below one that answers correctly. Prior work on tool-using agents suggests that dense intermediate rewards are important for training small models (see TIPS, ToolRL, StepTool below). We did not ablate this by testing terminal-only reward at 0.6B parameters, but the progressive reward signal let us verify that the agent was learning the right strategic patterns: reward climbed from -0.1 to 0.5-0.7 as the agent shifted from random tool calls to describe-then-query-then-answer sequences.</p>
225
 
226
- <p>The progress signal uses delta-from-previous-step: potential-based reward shaping (<a href="https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/NgHaradaRussell-shaping-ICML1999.pdf">Ng et al., 1999</a>). This preserves the optimal policy because the agent cannot game progress at the expense of correctness. We confirmed this empirically: cumulative progress caps (not potential-based) caused the agent to explore endlessly and never answer. Recent work validates this approach for tool-using agents. <a href="https://arxiv.org/abs/2603.22293">TIPS</a> (2026) showed potential-based turn-level shaping improved multi-turn agents by 11.8% over GRPO baselines. <a href="https://arxiv.org/html/2504.13958v1">ToolRL</a> (2025) found fine-grained reward decomposition improved tool learning by 17%. <a href="https://arxiv.org/abs/2410.07745">StepTool</a> (2024) confirmed step-grained shaping outperformed outcome-only rewards.</p>
 
 
 
 
227
 
228
  <h2>Training</h2>
229
 
230
- <p>We train Qwen3-0.6B with Group Relative Policy Optimization (GRPO). TRL's <code>environment_factory</code> runs the agent through SQLEnv for each rollout, comparing multiple rollouts per question to compute advantages.</p>
231
 
232
  <p>SFT warmup proved critical. Per-turn SFT (347 examples, each one assistant turn) taught the model to call describe forever. Half the training examples were describe calls, so the model learned "when asked a question, call describe." When we applied a KL penalty during RL, every rollout stayed identical to this SFT behavior. The advantage between rollouts was zero, so no policy gradient could form.</p>
233
 
@@ -241,7 +245,7 @@ obs = env.<span style="color:#d2a8ff">step</span>(SQLAction(action_type=<span st
241
 
242
  <div style="text-align: center; margin: 24px 0;">
243
  <img src="/Users/hjerp/Projects/sql-env/docs/rl-training-phase-1.png" alt="GRPO Training Progress" style="max-width: 100%; border-radius: 8px;">
244
- <p class="caption" style="text-align: center;">GRPO reward across Phase 1 (easy, beta=0.04) and Phase 2 (easy+medium, beta=0). Reward starts negative and climbs to 0.5-0.7 in Phase 1 as the agent learns describe-then-query-then-answer. Phase 2 holds at ~0.5. Peaks at 1.15 mark correctly solved episodes. 902 steps, ~4.75h on Colab L4.</p>
245
  </div>
246
 
247
  <p>SFT warmup takes ~1 minute (60 steps, loss drops from 1.6 to 0.08 in 2 epochs). GRPO Phase 1 runs ~2.25h, Phase 2 ~2.5h. Total pipeline: ~5 hours on a single Colab L4 (24GB VRAM), in one notebook session.</p>
@@ -314,16 +318,17 @@ answer("Albany")</div>
314
  <tr class="row-base"><td>Zero-shot</td><td>0%</td><td>24-28%</td><td>10.8-12.4</td></tr>
315
  <tr class="row-base"><td>1-shot</td><td>0-2%</td><td>16-17%</td><td>14.0-14.8</td></tr>
316
  <tr class="row-base"><td>3-shot</td><td>0%</td><td>19-20%</td><td>13.8-14.8</td></tr>
317
- <tr class="row-grpo"><td>GRPO (trained)</td><td>~30%</td><td>95-100%</td><td>3.5-4.0</td></tr>
 
318
  </table>
319
 
320
- <p><strong>95-100% parse rate</strong>: the trained model produces valid tool-call JSON. The base model fails 76-83% of the time and burns its step budget repeating malformed output. <strong>~30% accuracy from 0%</strong>: the base model cannot answer a single question even with 3 examples, but the trained model solves 12-16 out of 50 in 3-4 steps.</p>
321
 
322
- <p>We trained two GRPO checkpoints: v1 (2 epochs) and v2 (4 epochs). Both scored ~30% accuracy across two evaluation runs. The variation between runs (6-8 percentage points) was larger than the difference between checkpoints, indicating that more training does not raise the ceiling. For RL alone, this indicates that the bottleneck is the model's 0.6B pretraining rather than the training budget.</p>
323
 
324
  <h2>Limitations at 0.6B Parameters</h2>
325
 
326
- <p>Three failure modes define the current ceiling:</p>
327
 
328
  <ul>
329
  <li><strong>Column name hallucination.</strong> The model reads <code>FullName</code> from DESCRIBE but writes <code>full_name</code> in SQL, or reads <code>Horsepower</code> and writes <code>HorsepowerDESC</code> (missing space). Pretraining biases override the schema that the model just observed in context.</li>
@@ -331,7 +336,9 @@ answer("Albany")</div>
331
  <li><strong>More RL does not help.</strong> Extended training (v2: 4 total epochs) produced identical accuracy. This indicates the ceiling comes from pretraining knowledge rather than training budget.</li>
332
  </ul>
333
 
334
- <p>RL drives accuracy from 0% to 30% but saturates at 0.6B capacity. We did not explore whether SFT on multi-table reasoning or structured thinking before JOINs could push past this ceiling in our current work. We discuss possible directions in the Next Steps section.</p>
 
 
335
 
336
  <h2>The Learning Space</h2>
337
 
@@ -352,7 +359,7 @@ answer("Albany")</div>
352
  <li><strong>676 questions</strong> (473 train / 203 eval) across 10 Spider databases with difficulty labels</li>
353
  <li><strong>Typed models</strong> with Pydantic: every action, observation, and state is explicit and debuggable</li>
354
  <li><strong>Read-only SQL</strong> via SQLite <code>mode=ro</code>, where the database engine enforces safety rather than regex</li>
355
- <li><strong>Potential-based reward shaping</strong> (Ng et al., 1999) that provably preserves the optimal policy</li>
356
  <li><strong>TRL environment_factory</strong> integration for standard GRPO training without a custom loop</li>
357
  <li><strong>Green Agent evaluator</strong> with <code>Policy</code> protocol, <code>evaluate()</code> harness, and <code>RandomPolicy</code>/<code>OraclePolicy</code> baselines</li>
358
  </ul>
@@ -380,7 +387,7 @@ answer("Albany")</div>
380
 
381
  <p><strong>Transparent errors help the agent learn.</strong> When the environment returns <code>"Error: no such column: full_name"</code> instead of empty results, the agent can develop error-recovery strategies. Informative error messages give the RL training signal something to work with.</p>
382
 
383
- <p><strong>Dense rewards benefit from theoretical grounding.</strong> Potential-based shaping (Ng et al., 1999) guarantees the agent optimizes for the right objective. Without it, we observed agents accumulating exploration rewards instead of answering questions. Recent work supports this for tool-using agents. TIPS (2026) showed 11.8% gains from potential-based turn-level shaping. ToolRL (2025) found 17% improvement from fine-grained reward decomposition. StepTool (2024) confirmed step-grained shaping outperformed outcome-only rewards. These results suggest that principled reward design is important for multi-turn environments.</p>
384
 
385
  <p><strong>The environment is the contribution.</strong> The action space, reward function, and episode structure do not depend on the choice of model or RL algorithm. SQLEnv targets small models that need to learn database exploration through training, since larger models can often handle these tasks with few-shot prompting alone. As newer small language models become available, the environment provides a training ground for teaching them iterative reasoning.</p>
386
 
 
112
 
113
  </div>
114
 
115
+ <p>Both agents have the same four tools, the same 15-step budget, and the same databases. Different questions are shown to illustrate the range of behaviors; both agents were evaluated on the same 50-question set in the quantitative comparison below. The untrained agent wastes most of its steps without making progress. The trained agent first explores the schema, then hits an error, adapts, and solves a harder question in a third of the steps.</p>
116
 
117
  <h2>The Gap</h2>
118
 
119
+ <p>Standard Spider-style text-to-SQL (<a href="https://yale-lily.github.io/spider">Yu et al., 2018</a>) gives the model the full schema up front and scores the resulting SQL with exact-match, execution, or test-suite accuracy. This setup rewards memorization. The model never learns to explore a schema or iterate toward a solution, so it struggles on unfamiliar databases with many tables where the full schema cannot fit in context.</p>
120
 
121
  <p>SQLEnv takes a different approach. The agent progressively discovers the schema through its own actions: it starts with only table names and must call DESCRIBE, SAMPLE, and QUERY to reveal columns, types, and relationships within a fixed step budget. This is a POMDP (partially observable Markov decision process) where the agent acts under uncertainty, which makes exploration necessary and learnable.</p>
122
 
 
128
 
129
  <ul>
130
  <li><strong>DESCRIBE</strong> reveals column names and types for a table</li>
131
+ <li><strong>SAMPLE</strong> previews rows to understand the data (available but rarely used by the trained agent, which learned to rely on DESCRIBE and QUERY)</li>
132
  <li><strong>QUERY</strong> executes a read-only SQL query</li>
133
  <li><strong>ANSWER</strong> submits a final answer</li>
134
  </ul>
 
164
 
165
  <h2>Built on OpenEnv</h2>
166
 
167
+ <p><a href="https://github.com/meta-pytorch/OpenEnv">OpenEnv</a> provides a Gymnasium-style interface for agentic RL environments. The contract is simple:</p>
168
 
169
  <ul>
170
  <li><code>reset(seed)</code> starts a new episode and returns the initial observation</li>
171
  <li><code>step(action)</code> executes one action and returns observation, reward, and done flag</li>
172
  </ul>
173
 
174
+ <p>Pydantic models enforce typed contracts between agent and environment. We tested integration with TRL's <code>environment_factory</code> for GRPO training. OpenEnv is also designed to work with torchforge and Unsloth, though we have not tested those integrations. SQLEnv implements the interface with four actions (DESCRIBE, SAMPLE, QUERY, ANSWER):</p>
175
 
176
  <div class="legend">
177
  <div class="legend-item"><span class="legend-swatch" style="background:#d2a8ff;"></span> method</div>
 
223
 
224
  <p>Terminal correctness dominates. Maximum exploration reward across 15 steps is ~0.3, while a correct answer adds 1.0. An agent that explores but never answers always scores below one that answers correctly. Prior work on tool-using agents suggests that dense intermediate rewards are important for training small models (see TIPS, ToolRL, StepTool below). We did not ablate this by testing terminal-only reward at 0.6B parameters, but the progressive reward signal let us verify that the agent was learning the right strategic patterns: reward climbed from -0.1 to 0.5-0.7 as the agent shifted from random tool calls to describe-then-query-then-answer sequences.</p>
225
 
226
+ <p>The L2 progress component is inspired by potential-based reward shaping (<a href="https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/NgHaradaRussell-shaping-ICML1999.pdf">Ng et al., 1999</a>). The raw delta follows the potential-difference form (phi(s') - phi(s)), but the combined reward includes binning and per-step clipping for training stability, which departs from the strict theoretical guarantee. We confirmed empirically that the potential-difference structure matters: cumulative progress caps (not potential-based) caused the agent to explore endlessly and never answer.</p>
227
+
228
+ <p>Note that the L2 progress reward uses the gold answer (executed gold SQL rows) as a training-time verifier signal. This supervision is computed server-side and never visible to the agent, but it means the dense shaping relies on ground truth that is not available at inference.</p>
229
+
230
+ <p>Recent work supports dense shaping for tool-using agents. <a href="https://arxiv.org/abs/2603.22293">TIPS</a> (2026) reported 11.8% EM gains over PPO baselines in search-augmented multi-turn QA. <a href="https://arxiv.org/html/2504.13958v1">ToolRL</a> (2025) found 17% improvement over base models and 15% over SFT models through principled reward design for tool learning. <a href="https://arxiv.org/abs/2410.07745">StepTool</a> (2024) found step-grained shaping outperformed outcome-only rewards in tool-use benchmarks.</p>
231
 
232
  <h2>Training</h2>
233
 
234
+ <p>We train <a href="https://arxiv.org/abs/2505.09388">Qwen3-0.6B</a> with Group Relative Policy Optimization (<a href="https://arxiv.org/abs/2402.03300">GRPO</a>, from DeepSeekMath). TRL's <code>environment_factory</code> runs the agent through SQLEnv for each rollout, comparing multiple rollouts per question to compute advantages.</p>
235
 
236
  <p>SFT warmup proved critical. Per-turn SFT (347 examples, each one assistant turn) taught the model to call describe forever. Half the training examples were describe calls, so the model learned "when asked a question, call describe." When we applied a KL penalty during RL, every rollout stayed identical to this SFT behavior. The advantage between rollouts was zero, so no policy gradient could form.</p>
237
 
 
245
 
246
  <div style="text-align: center; margin: 24px 0;">
247
  <img src="/Users/hjerp/Projects/sql-env/docs/rl-training-phase-1.png" alt="GRPO Training Progress" style="max-width: 100%; border-radius: 8px;">
248
+ <p class="caption" style="text-align: center;">Batch-mean reward across Phase 1 (easy, beta=0.04) and Phase 2 (easy+medium, beta=0). Reward starts negative and climbs to 0.5-0.7 as the agent learns describe-then-query-then-answer. Phase 2 holds at ~0.5. Individual episodes that solve the question correctly receive up to 1.15 total reward, but the batch mean is lower because most rollouts within a batch include incorrect attempts. 902 steps, ~4.75h on Colab L4.</p>
249
  </div>
250
 
251
  <p>SFT warmup takes ~1 minute (60 steps, loss drops from 1.6 to 0.08 in 2 epochs). GRPO Phase 1 runs ~2.25h, Phase 2 ~2.5h. Total pipeline: ~5 hours on a single Colab L4 (24GB VRAM), in one notebook session.</p>
 
318
  <tr class="row-base"><td>Zero-shot</td><td>0%</td><td>24-28%</td><td>10.8-12.4</td></tr>
319
  <tr class="row-base"><td>1-shot</td><td>0-2%</td><td>16-17%</td><td>14.0-14.8</td></tr>
320
  <tr class="row-base"><td>3-shot</td><td>0%</td><td>19-20%</td><td>13.8-14.8</td></tr>
321
+ <tr class="row-grpo"><td>GRPO v1 (2 epochs)</td><td>28-30%</td><td>95-100%</td><td>3.5-4.0</td></tr>
322
+ <tr class="row-grpo"><td>GRPO v2 (4 epochs)</td><td>24-32%</td><td>87-95%</td><td>3.5-4.0</td></tr>
323
  </table>
324
 
325
+ <p><strong>Parse rate</strong>: the trained model (v1) produces valid tool-call JSON 95-100% of the time. The base model fails 76-83% of the time and burns its step budget repeating malformed output. <strong>Accuracy</strong>: the base model cannot answer a single question even with 3 examples, but the trained model solves 14-15 out of 50 on this curated Spider subset.</p>
326
 
327
+ <p>We trained two GRPO checkpoints: v1 (2 epochs) and v2 (4 epochs). Across two evaluation runs each, v1 scored 28-30% accuracy with 95-100% parse rate, while v2 scored 24-32% with 87-95% parse rate. The run-to-run variation (6-8 percentage points) at N=50 makes checkpoint-to-checkpoint differences hard to interpret. Extended training also introduced an abstention pattern: v2 sometimes outputs "Task complete" instead of calling answer() on uncertain questions, which increases parse failures but may reflect learned caution. On this subset, additional RL training did not improve accuracy, which indicates that the bottleneck is the model's 0.6B pretraining rather than the training budget.</p>
328
 
329
  <h2>Limitations at 0.6B Parameters</h2>
330
 
331
+ <p>On this curated 10-database Spider subset, three failure modes define the current ceiling:</p>
332
 
333
  <ul>
334
  <li><strong>Column name hallucination.</strong> The model reads <code>FullName</code> from DESCRIBE but writes <code>full_name</code> in SQL, or reads <code>Horsepower</code> and writes <code>HorsepowerDESC</code> (missing space). Pretraining biases override the schema that the model just observed in context.</li>
 
336
  <li><strong>More RL does not help.</strong> Extended training (v2: 4 total epochs) produced identical accuracy. This indicates the ceiling comes from pretraining knowledge rather than training budget.</li>
337
  </ul>
338
 
339
+ <p>On this subset, RL drives accuracy from 0% to ~30% but saturates at 0.6B capacity. The eval set is easy-heavy (91% single/two-table questions, no hard questions) and uses N=50 episodes per condition, so these results should not be generalized beyond this setup. Train and eval use mostly separate databases, with one schema (flight_2) appearing in both. We did not explore whether SFT on multi-table reasoning or structured thinking before JOINs could push past this ceiling in our current work. We discuss possible directions in the Next Steps section.</p>
340
+
341
+ <p>This evaluation is not comparable to the official Spider leaderboard, which uses different scoring (test-suite accuracy), full-schema input, and a broader database set.</p>
342
 
343
  <h2>The Learning Space</h2>
344
 
 
359
  <li><strong>676 questions</strong> (473 train / 203 eval) across 10 Spider databases with difficulty labels</li>
360
  <li><strong>Typed models</strong> with Pydantic: every action, observation, and state is explicit and debuggable</li>
361
  <li><strong>Read-only SQL</strong> via SQLite <code>mode=ro</code>, where the database engine enforces safety rather than regex</li>
362
+ <li><strong>Reward shaping</strong> inspired by potential-based methods (Ng et al., 1999), with practical modifications for training stability</li>
363
  <li><strong>TRL environment_factory</strong> integration for standard GRPO training without a custom loop</li>
364
  <li><strong>Green Agent evaluator</strong> with <code>Policy</code> protocol, <code>evaluate()</code> harness, and <code>RandomPolicy</code>/<code>OraclePolicy</code> baselines</li>
365
  </ul>
 
387
 
388
  <p><strong>Transparent errors help the agent learn.</strong> When the environment returns <code>"Error: no such column: full_name"</code> instead of empty results, the agent can develop error-recovery strategies. Informative error messages give the RL training signal something to work with.</p>
389
 
390
+ <p><strong>Dense rewards benefit from theoretical grounding.</strong> Potential-based shaping (Ng et al., 1999) provides the theoretical foundation for our reward design, though our implementation includes practical modifications (binning, clipping) that depart from the strict potential-difference form. Without some form of dense shaping, we observed agents accumulating exploration rewards instead of answering questions. Recent work supports this direction: TIPS (2026) reported gains over PPO baselines in multi-turn QA, ToolRL (2025) found improvements through principled reward decomposition, and StepTool (2024) found step-grained shaping outperformed outcome-only rewards. These results suggest that principled reward design is important for multi-turn environments.</p>
391
 
392
  <p><strong>The environment is the contribution.</strong> The action space, reward function, and episode structure do not depend on the choice of model or RL algorithm. SQLEnv targets small models that need to learn database exploration through training, since larger models can often handle these tasks with few-shot prompting alone. As newer small language models become available, the environment provides a training ground for teaching them iterative reasoning.</p>
393
 
docs/blog-post-v1.md CHANGED
@@ -32,11 +32,11 @@ ANSWER "Louis Deacon"</pre>
32
  </div>
33
  </div>
34
 
35
- Both agents have the same four tools, the same 15-step budget, and the same databases. The untrained agent wastes most of its steps without making progress. The trained agent first explores the schema, then hits an error, adapts, and solves a harder question in a third of the steps.
36
 
37
  ## The Gap
38
 
39
- Standard text-to-SQL evaluation works like this: hand the model a complete schema (all tables, all columns, all relationships) and ask it to produce one SQL query. If the query matches the gold answer, it passes. This setup rewards memorization. The model never learns to explore a schema or iterate toward a solution, so it struggles on unfamiliar databases with many tables where the full schema cannot fit in context.
40
 
41
  SQLEnv takes a different approach. The agent progressively discovers the schema through its own actions: it starts with only table names and must call DESCRIBE, SAMPLE, and QUERY to reveal columns, types, and relationships within a fixed step budget. This is a POMDP (partially observable Markov decision process) where the agent acts under uncertainty, which makes exploration necessary and learnable.
42
 
@@ -47,7 +47,7 @@ Consider the situation where you need to answer a question using data in an unfa
47
  SQLEnv captures this workflow. Four actions mirror what analysts do:
48
 
49
  - **DESCRIBE** reveals column names and types for a table
50
- - **SAMPLE** previews rows to understand the data
51
  - **QUERY** executes a read-only SQL query
52
  - **ANSWER** submits a final answer
53
 
@@ -81,12 +81,12 @@ Total reward: 1.180. Four steps. Exploration: 0.180, terminal: 1.000.
81
 
82
  ## Built on OpenEnv
83
 
84
- [OpenEnv](https://github.com/meta-pytorch/OpenEnv) is a standard protocol for RL environments. The contract is simple:
85
 
86
  - `reset(seed)` starts a new episode and returns the initial observation
87
  - `step(action)` executes one action and returns observation, reward, and done flag
88
 
89
- Pydantic models enforce typed contracts between agent and environment. Any environment that implements this protocol plugs into TRL, torchforge, and Unsloth without glue code.
90
 
91
  SQLEnv implements this protocol with four domain-specific actions:
92
 
@@ -121,11 +121,15 @@ Three layers of reward signal:
121
 
122
  Terminal correctness dominates. Maximum exploration reward across 15 steps is ~0.3, while a correct answer adds 1.0. An agent that explores but never answers always scores below one that answers correctly. Prior work on tool-using agents suggests that dense intermediate rewards are important for training small models (see TIPS, ToolRL, StepTool below). We did not ablate this by testing terminal-only reward at 0.6B parameters, but the progressive reward signal let us verify that the agent was learning the right strategic patterns: reward climbed from -0.1 to 0.5-0.7 as the agent shifted from random tool calls to describe-then-query-then-answer sequences.
123
 
124
- The progress signal uses delta-from-previous-step, a form of potential-based reward shaping ([Ng et al., 1999](https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/NgHaradaRussell-shaping-ICML1999.pdf)). This preserves the optimal policy because the agent cannot game progress at the expense of correctness. We confirmed this empirically: cumulative progress caps (not potential-based) caused the agent to explore endlessly and never answer. Recent work validates this approach for tool-using agents. [TIPS](https://arxiv.org/abs/2603.22293) (2026) showed potential-based turn-level shaping improved multi-turn agents by 11.8% over GRPO baselines. [ToolRL](https://arxiv.org/html/2504.13958v1) (2025) found fine-grained reward decomposition improved tool learning by 17%. [StepTool](https://arxiv.org/abs/2410.07745) (2024) confirmed step-grained shaping outperformed outcome-only rewards.
 
 
 
 
125
 
126
  ## Training
127
 
128
- We train Qwen3-0.6B with Group Relative Policy Optimization (GRPO). TRL's `environment_factory` runs the agent through SQLEnv for each rollout, comparing multiple rollouts per question to compute advantages.
129
 
130
  SFT warmup proved critical. Per-turn SFT (347 examples, each one assistant turn) taught the model to call describe forever. Half the training examples were describe calls, so the model learned "when asked a question, call describe." When we applied a KL penalty during RL, every rollout stayed identical to this SFT behavior. The advantage between rollouts was zero, so no policy gradient could form.
131
 
@@ -136,7 +140,7 @@ Two-phase curriculum:
136
  - **Phase 2**: Easy + medium (multi-table JOINs), KL removed (beta=0) so the agent can deviate further from SFT and discover new strategies. Reward holds at ~0.5.
137
 
138
  ![GRPO Training Progress](rl-training-phase-1.png)
139
- *GRPO reward across Phase 1 (easy, beta=0.04) and Phase 2 (easy+medium, beta=0). Reward starts negative and climbs to 0.5-0.7 as the agent learns describe-then-query-then-answer. Phase 2 holds at ~0.5. Peaks at 1.15 mark correctly solved episodes. 902 steps, ~4.75h on Colab L4.*
140
 
141
  SFT warmup takes ~1 minute (60 steps, loss drops from 1.6 to 0.08 in 2 epochs). GRPO Phase 1 runs ~2.25h, Phase 2 ~2.5h. Total pipeline: ~5 hours on a single Colab L4 (24GB VRAM), in one notebook session.
142
 
@@ -206,21 +210,24 @@ All conditions run through SQLEnv's Green Agent evaluator: `evaluate(env, policy
206
  | Zero-shot | 0% | 24-28% | 10.8-12.4 |
207
  | 1-shot | 0-2% | 16-17% | 14.0-14.8 |
208
  | 3-shot | 0% | 19-20% | 13.8-14.8 |
209
- | GRPO (trained) | ~30% | 95-100% | 3.5-4.0 |
 
210
 
211
- Two results stand out. **95-100% parse rate**: the trained model almost always produces valid tool-call JSON. The base model fails 76-83% of the time and burns its step budget repeating malformed output. **~30% accuracy from 0%**: the base model cannot answer a single question even with 3 examples, but the trained model solves 12-16 out of 50 in 3-4 steps.
212
 
213
- We trained two GRPO checkpoints: v1 (2 epochs) and v2 (4 epochs). Both scored ~30% accuracy across two evaluation runs. The variation between runs (6-8 percentage points) was larger than the difference between checkpoints, indicating that more training does not raise the ceiling. For RL alone, this indicates that the bottleneck is the model's 0.6B pretraining rather than the training budget.
214
 
215
  ## Limitations at 0.6B Parameters
216
 
217
- Three failure modes define the current ceiling:
218
 
219
  - **Column name hallucination.** The model reads `FullName` from DESCRIBE but writes `full_name` in SQL, or reads `Horsepower` and writes `HorsepowerDESC` (missing space). Pretraining biases override the schema that the model just observed in context.
220
  - **FK chain reasoning.** The model handles single-table queries well but fails on three-table JOINs such as Documents → Templates → Ref_Template_Types. It cannot chain foreign keys through intermediate tables.
221
  - **More RL does not help.** Extended training (v2: 4 total epochs) produced identical accuracy. This indicates the ceiling comes from pretraining knowledge rather than training budget.
222
 
223
- RL drives accuracy from 0% to 30% but saturates at 0.6B capacity. We did not explore whether SFT on multi-table reasoning or structured thinking before JOINs could push past this ceiling in our current work. We discuss possible directions in the Next Steps section.
 
 
224
 
225
  ## The Learning Space
226
 
@@ -239,7 +246,7 @@ Random scores 0.247 by accumulating small exploration rewards across 15 steps wi
239
  - **676 questions** (473 train / 203 eval) across 10 Spider databases with difficulty labels
240
  - **Typed models** with Pydantic: every action, observation, and state is explicit and debuggable
241
  - **Read-only SQL** via SQLite `mode=ro`, where the database engine enforces safety rather than regex
242
- - **Potential-based reward shaping** (Ng et al., 1999) that provably preserves the optimal policy
243
  - **TRL environment_factory** integration for standard GRPO training without a custom loop
244
  - **Green Agent evaluator** with `Policy` protocol, `evaluate()` harness, and `RandomPolicy`/`OraclePolicy` baselines
245
 
@@ -264,6 +271,6 @@ The environment supports two directions for improvement:
264
 
265
  **Transparent errors help the agent learn.** When the environment returns `"Error: no such column: full_name"` instead of empty results, the agent can develop error-recovery strategies. Informative error messages give the RL training signal something to work with.
266
 
267
- **Dense rewards benefit from theoretical grounding.** Potential-based shaping (Ng et al., 1999) guarantees the agent optimizes for the right objective. Without it, we observed agents accumulating exploration rewards instead of answering questions. Recent work supports this for tool-using agents. TIPS (2026) showed 11.8% gains from potential-based turn-level shaping. ToolRL (2025) found 17% improvement from fine-grained reward decomposition. StepTool (2024) confirmed step-grained shaping outperformed outcome-only rewards. These results suggest that principled reward design is important for multi-turn environments.
268
 
269
  **The environment is the contribution.** The action space, reward function, and episode structure do not depend on the choice of model or RL algorithm. SQLEnv targets small models that need to learn database exploration through training, since larger models can often handle these tasks with few-shot prompting alone. As newer small language models become available, the environment provides a training ground for teaching them iterative reasoning.
 
32
  </div>
33
  </div>
34
 
35
+ Both agents have the same four tools, the same 15-step budget, and the same databases. Different questions are shown to illustrate the range of behaviors; both agents were evaluated on the same 50-question set in the quantitative comparison below. The untrained agent wastes most of its steps without making progress. The trained agent first explores the schema, then hits an error, adapts, and solves a harder question in a third of the steps.
36
 
37
  ## The Gap
38
 
39
+ Standard Spider-style text-to-SQL ([Yu et al., 2018](https://yale-lily.github.io/spider)) gives the model the full schema up front and scores the resulting SQL with exact-match, execution, or test-suite accuracy. This setup rewards memorization. The model never learns to explore a schema or iterate toward a solution, so it struggles on unfamiliar databases with many tables where the full schema cannot fit in context.
40
 
41
  SQLEnv takes a different approach. The agent progressively discovers the schema through its own actions: it starts with only table names and must call DESCRIBE, SAMPLE, and QUERY to reveal columns, types, and relationships within a fixed step budget. This is a POMDP (partially observable Markov decision process) where the agent acts under uncertainty, which makes exploration necessary and learnable.
42
 
 
47
  SQLEnv captures this workflow. Four actions mirror what analysts do:
48
 
49
  - **DESCRIBE** reveals column names and types for a table
50
+ - **SAMPLE** previews rows to understand the data (available but rarely used by the trained agent, which learned to rely on DESCRIBE and QUERY)
51
  - **QUERY** executes a read-only SQL query
52
  - **ANSWER** submits a final answer
53
 
 
81
 
82
  ## Built on OpenEnv
83
 
84
+ [OpenEnv](https://github.com/meta-pytorch/OpenEnv) provides a Gymnasium-style interface for agentic RL environments. The contract is simple:
85
 
86
  - `reset(seed)` starts a new episode and returns the initial observation
87
  - `step(action)` executes one action and returns observation, reward, and done flag
88
 
89
+ Pydantic models enforce typed contracts between agent and environment. We tested integration with TRL's `environment_factory` for GRPO training. OpenEnv is also designed to work with torchforge and Unsloth, though we have not tested those integrations.
90
 
91
  SQLEnv implements this protocol with four domain-specific actions:
92
 
 
121
 
122
  Terminal correctness dominates. Maximum exploration reward across 15 steps is ~0.3, while a correct answer adds 1.0. An agent that explores but never answers always scores below one that answers correctly. Prior work on tool-using agents suggests that dense intermediate rewards are important for training small models (see TIPS, ToolRL, StepTool below). We did not ablate this by testing terminal-only reward at 0.6B parameters, but the progressive reward signal let us verify that the agent was learning the right strategic patterns: reward climbed from -0.1 to 0.5-0.7 as the agent shifted from random tool calls to describe-then-query-then-answer sequences.
123
 
124
+ The L2 progress component is inspired by potential-based reward shaping ([Ng et al., 1999](https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/NgHaradaRussell-shaping-ICML1999.pdf)). The raw delta follows the potential-difference form (phi(s') - phi(s)), but the combined reward includes binning and per-step clipping for training stability, which departs from the strict theoretical guarantee. We confirmed empirically that the potential-difference structure matters: cumulative progress caps (not potential-based) caused the agent to explore endlessly and never answer.
125
+
126
+ Note that the L2 progress reward uses the gold answer (executed gold SQL rows) as a training-time verifier signal. This supervision is computed server-side and never visible to the agent, but it means the dense shaping relies on ground truth that is not available at inference.
127
+
128
+ Recent work supports dense shaping for tool-using agents. [TIPS](https://arxiv.org/abs/2603.22293) (2026) reported 11.8% EM gains over PPO baselines in search-augmented multi-turn QA. [ToolRL](https://arxiv.org/html/2504.13958v1) (2025) found 17% improvement over base models and 15% over SFT models through principled reward design for tool learning. [StepTool](https://arxiv.org/abs/2410.07745) (2024) found step-grained shaping outperformed outcome-only rewards in tool-use benchmarks.
129
 
130
  ## Training
131
 
132
+ We train [Qwen3-0.6B](https://arxiv.org/abs/2505.09388) with Group Relative Policy Optimization ([GRPO](https://arxiv.org/abs/2402.03300), from DeepSeekMath). TRL's `environment_factory` runs the agent through SQLEnv for each rollout, comparing multiple rollouts per question to compute advantages.
133
 
134
  SFT warmup proved critical. Per-turn SFT (347 examples, each one assistant turn) taught the model to call describe forever. Half the training examples were describe calls, so the model learned "when asked a question, call describe." When we applied a KL penalty during RL, every rollout stayed identical to this SFT behavior. The advantage between rollouts was zero, so no policy gradient could form.
135
 
 
140
  - **Phase 2**: Easy + medium (multi-table JOINs), KL removed (beta=0) so the agent can deviate further from SFT and discover new strategies. Reward holds at ~0.5.
141
 
142
  ![GRPO Training Progress](rl-training-phase-1.png)
143
+ *Batch-mean reward across Phase 1 (easy, beta=0.04) and Phase 2 (easy+medium, beta=0). Reward starts negative and climbs to 0.5-0.7 as the agent learns describe-then-query-then-answer. Phase 2 holds at ~0.5. Individual episodes that solve the question correctly receive up to 1.15 total reward, but the batch mean is lower because most rollouts within a batch include incorrect attempts. 902 steps, ~4.75h on Colab L4.*
144
 
145
  SFT warmup takes ~1 minute (60 steps, loss drops from 1.6 to 0.08 in 2 epochs). GRPO Phase 1 runs ~2.25h, Phase 2 ~2.5h. Total pipeline: ~5 hours on a single Colab L4 (24GB VRAM), in one notebook session.
146
 
 
210
  | Zero-shot | 0% | 24-28% | 10.8-12.4 |
211
  | 1-shot | 0-2% | 16-17% | 14.0-14.8 |
212
  | 3-shot | 0% | 19-20% | 13.8-14.8 |
213
+ | GRPO v1 (2 epochs) | 28-30% | 95-100% | 3.5-4.0 |
214
+ | GRPO v2 (4 epochs) | 24-32% | 87-95% | 3.5-4.0 |
215
 
216
+ Two results stand out. **Parse rate**: the trained model (v1) produces valid tool-call JSON 95-100% of the time. The base model fails 76-83% of the time and burns its step budget repeating malformed output. **Accuracy**: the base model cannot answer a single question even with 3 examples, but the trained model solves 14-15 out of 50 on this curated Spider subset.
217
 
218
+ We trained two GRPO checkpoints: v1 (2 epochs) and v2 (4 epochs). Across two evaluation runs each, v1 scored 28-30% accuracy with 95-100% parse rate, while v2 scored 24-32% with 87-95% parse rate. The run-to-run variation (6-8 percentage points) at N=50 makes checkpoint-to-checkpoint differences hard to interpret. Extended training also introduced an abstention pattern: v2 sometimes outputs "Task complete" instead of calling answer() on uncertain questions, which increases parse failures but may reflect learned caution. On this subset, additional RL training did not improve accuracy, which indicates that the bottleneck is the model's 0.6B pretraining rather than the training budget.
219
 
220
  ## Limitations at 0.6B Parameters
221
 
222
+ On this curated 10-database Spider subset, three failure modes define the current ceiling:
223
 
224
  - **Column name hallucination.** The model reads `FullName` from DESCRIBE but writes `full_name` in SQL, or reads `Horsepower` and writes `HorsepowerDESC` (missing space). Pretraining biases override the schema that the model just observed in context.
225
  - **FK chain reasoning.** The model handles single-table queries well but fails on three-table JOINs such as Documents → Templates → Ref_Template_Types. It cannot chain foreign keys through intermediate tables.
226
  - **More RL does not help.** Extended training (v2: 4 total epochs) produced identical accuracy. This indicates the ceiling comes from pretraining knowledge rather than training budget.
227
 
228
+ On this subset, RL drives accuracy from 0% to ~30% but saturates at 0.6B capacity. The eval set is easy-heavy (91% single/two-table questions, no hard questions) and uses N=50 episodes per condition, so these results should not be generalized beyond this setup. Train and eval use mostly separate databases, with one schema (flight_2) appearing in both. We did not explore whether SFT on multi-table reasoning or structured thinking before JOINs could push past this ceiling in our current work. We discuss possible directions in the Next Steps section.
229
+
230
+ This evaluation is not comparable to the official Spider leaderboard, which uses different scoring (test-suite accuracy), full-schema input, and a broader database set.
231
 
232
  ## The Learning Space
233
 
 
246
  - **676 questions** (473 train / 203 eval) across 10 Spider databases with difficulty labels
247
  - **Typed models** with Pydantic: every action, observation, and state is explicit and debuggable
248
  - **Read-only SQL** via SQLite `mode=ro`, where the database engine enforces safety rather than regex
249
+ - **Reward shaping** inspired by potential-based methods (Ng et al., 1999), with practical modifications for training stability
250
  - **TRL environment_factory** integration for standard GRPO training without a custom loop
251
  - **Green Agent evaluator** with `Policy` protocol, `evaluate()` harness, and `RandomPolicy`/`OraclePolicy` baselines
252
 
 
271
 
272
  **Transparent errors help the agent learn.** When the environment returns `"Error: no such column: full_name"` instead of empty results, the agent can develop error-recovery strategies. Informative error messages give the RL training signal something to work with.
273
 
274
+ **Dense rewards benefit from theoretical grounding.** Potential-based shaping (Ng et al., 1999) provides the theoretical foundation for our reward design, though our implementation includes practical modifications (binning, clipping) that depart from the strict potential-difference form. Without some form of dense shaping, we observed agents accumulating exploration rewards instead of answering questions. Recent work supports this direction: TIPS (2026) reported gains over PPO baselines in multi-turn QA, ToolRL (2025) found improvements through principled reward decomposition, and StepTool (2024) found step-grained shaping outperformed outcome-only rewards. These results suggest that principled reward design is important for multi-turn environments.
275
 
276
  **The environment is the contribution.** The action space, reward function, and episode structure do not depend on the choice of model or RL algorithm. SQLEnv targets small models that need to learn database exploration through training, since larger models can often handle these tasks with few-shot prompting alone. As newer small language models become available, the environment provides a training ground for teaching them iterative reasoning.
scratch_hf_smoke.py ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from sql_env.client import SQLEnvClient
2
+ from sql_env.models import SQLAction
3
+
4
+ URL = "https://hjerpe-sql-env.hf.space"
5
+
6
+ with SQLEnvClient(base_url=URL) as env:
7
+ r = env.reset()
8
+ print("RESET observation:", r.observation)
9
+ print()
10
+
11
+ # Pick a real table name from r.observation.schema_info before running
12
+ # the next two steps — edit this script after reset prints something.
13
+ r = env.step(SQLAction(action_type="DESCRIBE", argument="<TABLE>"))
14
+ print("DESCRIBE:", r.observation)
15
+
16
+ r = env.step(SQLAction(action_type="QUERY", argument="SELECT 1"))
17
+ print("QUERY:", r.observation)
18
+
19
+ r = env.step(SQLAction(action_type="ANSWER", argument="test"))
20
+ print("ANSWER reward=", r.reward, "done=", r.done)