Blog Material — Raw Knowledge Dump
Reference file for writing the SQLEnv blog post. Contains observations, training data, failure modes, and narrative threads extracted from 9 training runs. The blog outline is at docs/blog-outline.md, the draft at docs/blog-post.md.
Training Run Summary
Run progression (what each run taught us)
- Run 1: SFT works, GRPO plateaus — no penalty for post-episode waste
- Run 2: Qwen3 tokenizer expands dict args to null params — root cause of first collapse
- Run 3: Without KL penalty, GRPO drifts structural tokens (
<tool_response>instead of<tool_call>) - Run 4: KL penalty + reference model = OOM on L4
- Run 5: KL too conservative with single-turn SFT — model only calls describe, never queries
- Run 6: Multi-turn SFT breakthrough — first successful training, reward -0.1→0.7
- Run 7: Repeat penalty, stable training, multi-table weakness exposed
- Run 8: Thinking mode helps error recovery, introduces
<think>assistantdegenerate loop, OOM crash - Run 9: v2 continued training confirms ceiling — more epochs don't help medium questions
Key numbers
| Metric | Value |
|---|---|
| Model | Qwen3-0.6B (target <0.5B per VISION.md, using 0.6B) |
| Hardware | Colab L4 (24GB VRAM) |
| SFT data | 120 multi-turn trajectories, assistant_only_loss |
| Phase 1 | 435 easy questions, beta=0.04, 512 tokens, ~3.5h |
| Phase 2 | 467 easy+medium, beta=0.0, 1024 tokens, ~3.8h |
| Total training | ~7.3h on L4 for both phases |
| Phase 1 reward | -0.1 → 0.5-0.7 (clear upward trend) |
| Phase 2 reward | oscillates 0.01-1.15, mean ~0.5 (no improvement over Phase 1) |
| Parse rate | >98% (valid tool-call JSON) |
| Eval accuracy (GRPO v1, N=50) | 28-30% across runs (vs 0% base model) |
| Eval accuracy (GRPO v2, N=50) | 24-32% across runs (within noise of v1) |
| Eval accuracy (base, 3-shot) | 0% (format compliance only 19-20%) |
Concrete Episode Examples (for blog)
Success: Simple aggregation (3 tool calls)
Question: "What is the total bonus given in all evaluations?"
describe("evaluation")→ Employee_ID, Year_awarded, Bonus columnsquery("SELECT SUM(Bonus) FROM evaluation")→ 19500.0answer("19500.0")→ correct Reward: 1.15
Success: Error recovery with JOIN
Question: "Which employee received the biggest bonus?"
describe("employee")→ Employee_ID, Name, Age, Cityquery("SELECT Name FROM employee ORDER BY Salary DESC LIMIT 1")→ Error: no such column: Salarydescribe("evaluation")→ Employee_ID, Year_awarded, Bonusquery("SELECT Name FROM evaluation JOIN employee ON evaluation.Employee_ID = employee.Employee_ID ORDER BY Bonus DESC LIMIT 1")→ Louis Deaconanswer("Louis Deacon")→ correct Reward: 1.13
Success: Subquery with comparison
Question: "How many cars have greater accelerate than the one with the most horsepower?"
describe("cars_data")→ Id, MPG, Cylinders, Edispl, Horsepower, Weight, Accelerate, Yearquery("SELECT COUNT(*) FROM cars_data WHERE Accelerate > (SELECT Accelerate FROM cars_data ORDER BY Horsepower DESC LIMIT 1)")→ 39answer("39")→ correct Reward: 1.15
Failure: Column name hallucination
Question: "What are all the makers and models?"
describe("cars_data")→ Id, MPG, Cylinders, Edispl, Horsepower, Weight, Accelerate, Year (no "model" column)query("SELECT Id, model FROM cars_data")→ Error: no such column: modeldescribe("model_list")→ ModelId, Maker, Modelquery("SELECT Maker, Model FROM model_list")→ correct results (truncated to 20 rows)answer("1 | amc\n2 | audi\n3 | bmw...")→ incorrect (truncated, missing rows) Model found the right data but answer was truncated by the 20-row display limit.
Failure: FK chain reasoning
Question: "Which city has most number of arriving flights?"
describe("AIRPORTS")→ City, AirportCode, AirportName, Country, CountryAbbrevquery("SELECT CITY, COUNT(*) FROM AIRPORTS GROUP BY CITY ORDER BY COUNT(*) DESC LIMIT 1")→ Albany | 4answer("Albany")→ incorrect Model counted airports per city, not flights per city. Needed to JOIN flights→airports on DestAirport=AirportCode, then GROUP BY City. The FK chain reasoning is missing.
Failure: Invented table
Question: "What are all details of students who registered but did not attend any course?"
describe("Student_Course_Registrations")→ student_id, course_id, registration_datequery("SELECT ... FROM Student_Course_Registrations WHERE course_id NOT INSELECT ...")→ syntax error (missing space)describe("Course_Attendance")→ Error: table not found. Available: Student_Course_Attendance Model invented "Course_Attendance" instead of reading the available table list. Cascading errors consumed the full step budget.
Reward Architecture Details
Three-layer structure
L1 Operational (every step):
+0.02 exec_ok
+0.01 new_info (unique SQL hash)
-0.03 repeat penalty
-0.02 step cost
L2 Progress (QUERY only):
Delta from previous binned progress × 0.15
Binned to {0, 0.25, 0.5, 0.75, 1.0}
L3 Terminal (ANSWER only):
+1.0 correct, 0.0 wrong
Per-step clip: [-0.10, 0.15]
Why potential-based shaping matters
- Ng et al. (1999): F(s,s') = Φ(s') - Φ(s) preserves optimal policy
- Our delta progress IS potential-based with γ=1
- Cumulative caps are NOT potential-based (depend on trajectory history)
- Without this guarantee, agents learn to farm exploration rewards
Anti-farming mechanisms
- Hard budget (15 steps)
- Step cost (-0.02)
- Repeat penalty (-0.03)
- Terminal dominance (1.0 vs ~0.3 max exploration)
- Per-step clip [-0.10, 0.15]
- Post-episode penalty (-0.3)
Eval Results (N=50, 2026-04-11)
Comparison table (for blog, N=50 with retry, 2026-04-11, Run B)
| Method | Accuracy | Avg Reward | Avg Steps | Parse Rate | Parse Fails | Budget Exhaust |
|---|---|---|---|---|---|---|
| zero-shot | 0.0% | 0.007 | 12.4 | 23.6% | 434 | 38 |
| 1-shot | 2.0% | 0.061 | 14.0 | 17.0% | 537 | 46 |
| 3-shot | 0.0% | 0.057 | 14.8 | 19.0% | 551 | 49 |
| GRPO v1 | 30.0% | 0.386 | 3.5 | 100.0% | 0 | 0 |
| GRPO v2 | 24.0% | 0.321 | 3.6 | 95.1% | 8 | 1 |
Previous run (Run A, same day, same seed)
| Method | Accuracy | Avg Reward | Avg Steps | Parse Rate | Budget Exhaust |
|---|---|---|---|---|---|
| zero-shot | 0.0% | 0.016 | 10.8 | 28.1% | 31/50 |
| 1-shot | 0.0% | 0.031 | 14.8 | 15.6% | 49/50 |
| 3-shot | 0.0% | 0.041 | 13.8 | 20.3% | 44/50 |
| GRPO v1 | 28.0% | 0.355 | 4.0 | 95.0% | 2/50 |
| GRPO v2 | 32.0% | 0.400 | 3.7 | 87.1% | 2/50 |
Run-to-run variation (important for blog)
v1 and v2 show similar accuracy with noise at N=50: v1 scored 28% then 30%, v2 scored 32% then 24%. The difference between checkpoints is within run-to-run variation (6-8pp swing). For the blog, report both as "28-32% accuracy" or "roughly 30%" rather than claiming one is better. The meaningful comparison is GRPO (~30%) vs base model (0-2%), not v1 vs v2.
The variation comes from: (1) temperature sampling during generation, (2) question selection randomness at N=50, (3) v2's "Task complete." abstention pattern — on borderline questions, whether v2 guesses or abstains varies by run, causing larger accuracy swings.
Note: parse failures no longer end episodes — model gets a no-op DESCRIBE and continues. This gives base models the same step budget as trained models, but they waste it on repeated parse failures (avg 11-15 steps vs GRPO's 3.5-4.0).
Key observations from N=50 eval (with retry, 2 runs)
- ~30% accuracy for GRPO vs 0-2% for base model across all conditions. v1 and v2 are statistically indistinguishable (28-30% vs 24-32% across runs).
- Run-to-run variation is ~6-8pp — v1 scored 28% then 30%, v2 scored 32% then 24%. At N=50, don't over-interpret small differences between checkpoints.
- Base model parse failure loop — without episode termination on parse failure, base models burn their entire 15-step budget repeating the same non-tool-call output (e.g., "- Single value: []" 11 times). 46-49/50 1-shot episodes hit budget exhaustion.
- GRPO solves format compliance — 95-100% parse rate (v1) vs 17-28% for base. The trained model almost always produces valid
<tool_call>JSON. - GRPO failure mode is SQL quality, not format — episodes with correct tool-call format but wrong SQL/answer dominate GRPO failures.
- Extra turns don't help base models — more steps just mean more repeated failures. The fundamental gap is format compliance, not exploration budget.
- 1-shot occasionally gets lucky — scored 2% in Run B (1/50 correct), 0% in Run A. At N=50, a single lucky episode swings accuracy by 2pp.
v2 vs v1: similar accuracy, more parse failures — behavioral shift
Across two runs, v1 and v2 show overlapping accuracy ranges (28-30% vs 24-32%). The difference is within run-to-run variation at N=50. However, v2 consistently shows more parse failures (8-22 vs 0-8), revealing a behavioral shift from continued training:
- v1 guesses more: v1 almost always calls
answer(), even when uncertain. It submits wrong answers confidently (0 parse failures in Run B, 100% parse rate). - v2 gives up on hard questions: v2 produces "Task complete." output after multiple failed queries instead of calling
answer(), producing parse failures. v2 learned that some questions are unsolvable. - Neither is clearly better: v2's caution helps on some runs (32% in Run A) and hurts on others (24% in Run B). The abstention behavior adds variance. For the blog, present them as equivalent (~30%) with a qualitative note about the behavioral difference.
The v2 parse failure pattern (from raw output):
[OK] DESCRIBE: country
[OK] QUERY: SELECT Name FROM country WHERE Population < ...
[PARSE FAIL] raw: Task complete. ← gives up, doesn't call answer()
[PARSE FAIL] raw: Task complete. ← repeats until budget
Compare v1 on the same type of question:
[OK] DESCRIBE: country
[OK] QUERY: SELECT Name FROM country WHERE ...
[OK] ANSWER: European cities and their names are: 42 ← wrong, but at least calls answer()
This is a form of calibrated uncertainty — v2 is better at knowing what it doesn't know. The incorrect answer reward of 0.0 (see learning #19 in session log) creates an avoid-answering incentive that v2 has partially internalized. A more generous incorrect-answer reward (e.g., +0.1 for attempting an answer in correct format) might recover these episodes.
For the blog narrative
The story is clear: GRPO teaches format compliance (0% → 95-100% parse rate) and strategic tool use (describe→query→answer in 3-4 steps). Base models waste 15 steps repeating parse failures. The ~30% accuracy ceiling (consistent across checkpoints and runs) comes from the 0.6B model's SQL reasoning capacity, not from the environment or training pipeline. The environment scales; the model doesn't. Report v1 and v2 as "roughly 30%" — the variation between runs is larger than the difference between checkpoints.
Format Mismatch Discovery (F011)
The three differences between eval and training
- role:tool vs role:user — Qwen3 renders
role:"tool"as<|im_start|>user\n<tool_response>...</tool_response>,role:"user"as<|im_start|>user\nplain text. Same role token, different content structure. - Structured tool_calls vs raw text — Training uses
{"role":"assistant", "tool_calls":[{"function":{"name":"describe","arguments":"{...}"}}]}, eval was using{"role":"assistant", "content":"<tool_call>...</tool_call>"}. - No separator vs
\n\n— TRL appendsreset()return directly to user message. Eval hadquestion\n\ntable_hint.
Impact
Before fix: 0% accuracy across ALL conditions (zero-shot, 1-shot, 3-shot, GRPO checkpoint). After fix: 10% zero-shot, 30% 1-shot, 50% 3-shot on base model. GRPO checkpoint still 10%.
Lesson
Eval format matching is not a nice-to-have. It's a prerequisite for ANY measurement. We spent time debugging model quality when the problem was plumbing.
Multi-Turn SFT — Why It's Critical
Per-turn SFT (broken)
- 347 examples, each one assistant turn
- ~50% were describe calls
- Model learned: "when asked a question, call describe"
- With KL penalty, model stayed anchored to this single-action policy
- Result: reward=0.00, all rollouts identical, advantage=0
Multi-turn SFT (working)
- 120 examples, each a full describe→query→answer trajectory
assistant_only_lossvia Qwen3 template patch ({% generation %}tags)- Model learned: the SEQUENCE describe→query→answer
- With KL penalty, model explores within the multi-turn strategy
- Result: reward climbs to 0.7 in Phase 1
Template patch detail
Qwen3's chat template lacks {% generation %} tags needed by TRL for assistant_only_loss. We patch the template before SFT, restore original before GRPO (TRL does exact-match checks on template string in add_response_schema() and get_training_chat_template()).
The 0.6B Capacity Ceiling
What works at 0.6B
- Single-table queries: COUNT, SUM, AVG, MIN, MAX, GROUP BY, HAVING, ORDER BY, LIMIT
- Simple JOINs between 2 tables when FK is obvious (evaluation.Employee_ID = employee.Employee_ID)
- WHERE with LIKE, IN, BETWEEN, NOT IN subqueries
- Answer formatting: comma lists, pipe-delimited rows,
[]for empty - Error recovery: describe after SQL error, retry with correct column names
sampletool usage (learned in Run 6, inconsistent later)
What breaks at 0.6B
- FK chain reasoning: 3+ table joins (Documents→Templates→Ref_Template_Types)
- Column name fidelity: reads
FullNamefrom describe, writesfull_namein SQL - Ambiguous column resolution: joins with same column name in both tables
- Complex subqueries: INTERSECT, EXCEPT, correlated subqueries with HAVING
- "stadium without concert" pattern: NOT IN with JOIN to get names
- Aggregate + GROUP BY + HAVING chains on multi-table joins
The hallucination pattern
The model describes a table and sees the exact column names. Then it writes SQL using pretrained column names that don't match. This isn't a memory problem — the schema is in the context window. It's a weight problem — pretraining biases override in-context information at 0.6B scale.
Thinking Mode Observations (Run 8)
Benefits
- Reasons through SQL errors: "no such column: airport_code" →
<think>block → triesAirportCode - Empty
<think></think>on easy questions — token-efficient, emergent behavior - Multi-step join planning in think blocks
New failure mode
~23% of rollouts: <think>assistant<think>assistant... repeating until token limit. Model fails to close </think> tag. Burns entire token budget with garbage.
OOM risk
Thinking blocks consume more tokens → higher peak memory during generation. Phase 2 crashed at step 182/467 with max_new_tokens=1280. Fix: reduce to 1024, or reduce num_generations from 4 to 3.
Narrative Threads for Blog
"The environment is the product"
From VISION.md: "SQLEnv is a reinforcement learning environment — not a text-to-SQL model. The environment is the product." The trained agent demonstrates that the environment works, but the contribution is the action space, reward architecture, and episode structure.
"Small model showing improvement proves more than large model with marginal gains"
A 0.6B model going from 0% to 10% accuracy with clear strategic behavior (describe→query→answer, error recovery) proves the environment produces learning signal. A 70B model with marginal gains would prove nothing about the environment.
"Analysts don't write perfect queries from scratch"
The hook. Frame the problem as: text-to-SQL evaluates guessing, not investigating. SQLEnv evaluates the process.
"Dense rewards need theory"
Potential-based shaping isn't just good practice — it's the guarantee that the agent optimizes for the right objective. Without it, we saw agents farming exploration rewards.
"Multi-turn SFT teaches strategy, not actions"
The difference between per-turn and multi-turn SFT is the difference between teaching vocabulary and teaching conversation.
References for Blog
- Ng, Harada, Russell (1999). Policy invariance under reward transformations. ICML.
- DeepSeek-AI (2025). DeepSeek-R1.
- Shao et al. (2024). DeepSeek-Math: GRPO.
- Sullivan et al. (2025/2026). GRPO is Secretly a Process Reward Model. ICLR 2026.
- Yu et al. (2018). Spider dataset.
- Li et al. (2023). BIRD benchmark.
- TIPS (2026). Turn-Level Information-Potential Reward Shaping.
- ToolRL (2025). Reward is All Tool Learning Needs.
- StepTool (2024). Step-grained RL for Tool Learning.
Showcase Notebook Transcripts (for blog)
Random agent episode (seed=7) — comedic failure
Question: "Count the number of paragraphs."
SAMPLE Paragraphs → reward=0.015
SAMPLE Documents → reward=0.015
DESCRIBE Documents → reward=0.015
SAMPLE Documents → reward=0.015 (repeat)
DESCRIBE Documents → reward=0.015 (repeat)
DESCRIBE Documents → reward=0.015 (repeat)
DESCRIBE Templates → reward=0.015
SAMPLE Documents → reward=0.015 (repeat)
DESCRIBE Documents → reward=0.015 (repeat)
QUERY SELECT * FROM "Templates" LIMIT 5 → reward=0.0625
DESCRIBE Documents → reward=0.015 (repeat)
DESCRIBE Paragraphs → reward=0.015
QUERY SELECT * FROM "Paragraphs" LIMIT 5 → reward=0.025
QUERY SELECT * FROM "Documents" LIMIT 5 → reward=0.025
ANSWER 76 | 20 | Robbin CV | y | None → reward=0.000 (incorrect)
Total reward: 0.278. Used all 15 steps. Described Documents 5 times. Answered with a random row from the wrong table. Never wrote SELECT COUNT(*).
Oracle agent episode (seed=0) — clean solve
Question: "List the id of students who registered some courses and the number of their registered courses?"
Step 1: DESCRIBE student_course_registrations
→ student_id INTEGER, course_id INTEGER, registration_date DATETIME
→ reward: +0.015
Step 2: DESCRIBE students
→ student_id INTEGER, student_details VARCHAR(255)
→ reward: +0.015
Step 3: QUERY
SELECT T1.student_id, count(*)
FROM students AS T1
JOIN student_course_registrations AS T2
ON T1.student_id = T2.student_id
GROUP BY T1.student_id
→ 111|1, 121|2, 131|1, 141|2, 151|1, 161|1, 171|1
→ reward: +0.150
Step 4: ANSWER [[111,1],[121,2],[131,1],[141,2],[151,1],[161,1],[171,1]]
→ correct
→ reward: +1.000
Total reward: 1.180. 4 steps, efficient. Exploration (L1+L2): 0.180, Terminal (L3): 1.000.
Baseline comparison (50 episodes each)
| Policy | Success Rate | Avg Reward | Avg Steps |
|---|---|---|---|
| Random | 0.0% | 0.247 | 15.0 |
| Oracle | 100.0% | 1.168 | 3.5 |
The gap between 0.247 and 1.168 defines the learning space. A trained agent lands somewhere between.
Reward constants (from server/reward.py)
+0.02 successful execution (no errors)
+0.01 new information (unique query)
-0.02 step cost (every action)
-0.03 repeat penalty (duplicate SQL)
[-0.10, +0.15] per-step clipping range
+1.0 correct answer (terminal)
+0.0 wrong answer (terminal)
Terminal dominance: max exploration over 15 steps is ~0.3 (15 * 0.02 best case), while a correct answer adds 1.0.
Competition Context
OpenEnv Challenge (our target)
- Sponsors: PyTorch/Meta, HuggingFace, Unsloth
- Prize: $10K HF credits
- Judging: primarily blog-based
- Criteria: Creative OpenEnv use, Technical excellence, Storytelling, Open source demo, Green Agent wrapper
- Green Agent wrapper is an explicit judging criterion in the OpenEnv Challenge.
Deliverables
- Environment on HF Hub — live at https://huggingface.co/spaces/hjerpe/sql_env
(pushed 2026-03-29; Docker image at
registry.hf.space/hjerpe-sql_env:latest) - Training notebooks/scripts on GitHub —
notebooks/train_grpo.ipynb,notebooks/compare_methods.ipynb,notebooks/showcase_sqlenv.ipynb - Blog on HuggingFace —
docs/blog-post-v1.md(draft)
TRL integration status (already done — do not re-research)
training/trl_adapter.py::SQLEnvTRL is a TRL-native environment_factory
class: reset() + named tool methods describe() / sample() / query() / answer() with docstrings TRL uses to build the tool schema. The notebook
passes it directly: GRPOTrainer(..., environment_factory=SQLEnvTRL, reward_funcs=[sql_env_reward_func]). The adapter runs SQLEnvironment
in-process (not a WebSocket client to the HF Space) — intentional, because
training opens N parallel sessions and the Space defaults to 1.
Competitive landscape
- SQL Repair (WALKMAN303) — buggy SQL fix, simpler than our multi-turn exploration
- Calendar Gym (Turing) — featured on HF blog, real-world framing + failure analysis
- OpenSec — cybersecurity with arXiv paper, adversarial evidence injection
- Our position: no interactive SQL exploration environment exists. SQL Repair is single-turn fix-it; we're multi-turn strategy discovery.
What winning entries do
- Stakes framing — "this matters in production"
- Concrete failure analysis with numbers
- Contrast (random vs trained vs oracle)
- Real data, not toy puzzles
- Non-obvious insights from training
Green Agent Evaluator
What it is
OpenEnv's standardized evaluation wrapper pattern. A Policy protocol with evaluate(env, policy, n_episodes, seed) that runs any policy through the environment and reports aggregate metrics. Listed as an explicit judging criterion in the OpenEnv Challenge.
Implementation
evaluation/policies.py—Policyprotocol,evaluate()harness,RandomPolicy,EpisodeResult,EvaluationResultevaluation/oracle_policy.py—OraclePolicybaseline (runs gold SQL)tests/test_evaluation.py— 17 tests, all passing (unit + integration)
How it works
from sql_env.evaluation import evaluate, RandomPolicy, OraclePolicy
# Run 50 episodes with random policy
result = evaluate(env, RandomPolicy(seed=0), n_episodes=50, seed=0)
print(f"Success: {result.success_rate:.1%}, Reward: {result.avg_reward:.3f}")
# Run with trained policy (any class with select_action method)
result = evaluate(env, trained_policy, n_episodes=50, seed=42)
Where it's used
notebooks/showcase_sqlenv.ipynb— Random vs Oracle baseline comparisonnotebooks/compare_methods.ipynb— All 5 conditions (zero-shot, 1-shot, 3-shot, GRPO v1, v2) run throughevaluate()
Key design choices
- Error isolation: one episode crashing doesn't kill the run — logged as
EpisodeResult(error=str(exc)) - Deterministic seeding:
seed + episode_indexper episode for reproducibility - Protocol-based: any class with
select_action(observation) -> actionworks — no inheritance required - Aggregate + per-episode:
EvaluationResulthas both summary metrics and fullepisodeslist for drill-down
For the blog
The Green Agent evaluator is the backbone of all evaluation. Every result in the comparison table flows through evaluate(). The trained GRPO model is wrapped in LLMToolCallingPolicy (which implements the Policy protocol) and evaluated identically to the Random and Oracle baselines. This is the standardized, reproducible evaluation pattern the challenge asks for.
Files to Reference
| File | Relevance |
|---|---|
docs/blog-outline.md |
Section structure template |
docs/blog-post.md |
Current draft |
docs/design-docs/reward-shaping-research.md |
Reward theory + references |
docs/exploration/grpo-training-session-log.md |
All 9 runs detailed |
vision/VISION.md |
Product vision, success metrics |
training/trl_adapter.py |
Environment adapter code |
notebooks/compare_methods.ipynb |
Eval notebook |
notebooks/train_grpo.ipynb |
Training notebook |