sql_env / docs /blog-material.md
hjerpe's picture
Upload folder using huggingface_hub
9e64e71 verified

Blog Material — Raw Knowledge Dump

Reference file for writing the SQLEnv blog post. Contains observations, training data, failure modes, and narrative threads extracted from 9 training runs. The blog outline is at docs/blog-outline.md, the draft at docs/blog-post.md.

Training Run Summary

Run progression (what each run taught us)

  1. Run 1: SFT works, GRPO plateaus — no penalty for post-episode waste
  2. Run 2: Qwen3 tokenizer expands dict args to null params — root cause of first collapse
  3. Run 3: Without KL penalty, GRPO drifts structural tokens (<tool_response> instead of <tool_call>)
  4. Run 4: KL penalty + reference model = OOM on L4
  5. Run 5: KL too conservative with single-turn SFT — model only calls describe, never queries
  6. Run 6: Multi-turn SFT breakthrough — first successful training, reward -0.1→0.7
  7. Run 7: Repeat penalty, stable training, multi-table weakness exposed
  8. Run 8: Thinking mode helps error recovery, introduces <think>assistant degenerate loop, OOM crash
  9. Run 9: v2 continued training confirms ceiling — more epochs don't help medium questions

Key numbers

Metric Value
Model Qwen3-0.6B (target <0.5B per VISION.md, using 0.6B)
Hardware Colab L4 (24GB VRAM)
SFT data 120 multi-turn trajectories, assistant_only_loss
Phase 1 435 easy questions, beta=0.04, 512 tokens, ~3.5h
Phase 2 467 easy+medium, beta=0.0, 1024 tokens, ~3.8h
Total training ~7.3h on L4 for both phases
Phase 1 reward -0.1 → 0.5-0.7 (clear upward trend)
Phase 2 reward oscillates 0.01-1.15, mean ~0.5 (no improvement over Phase 1)
Parse rate >98% (valid tool-call JSON)
Eval accuracy (GRPO v1, N=50) 28-30% across runs (vs 0% base model)
Eval accuracy (GRPO v2, N=50) 24-32% across runs (within noise of v1)
Eval accuracy (base, 3-shot) 0% (format compliance only 19-20%)

Concrete Episode Examples (for blog)

Success: Simple aggregation (3 tool calls)

Question: "What is the total bonus given in all evaluations?"

  1. describe("evaluation") → Employee_ID, Year_awarded, Bonus columns
  2. query("SELECT SUM(Bonus) FROM evaluation") → 19500.0
  3. answer("19500.0") → correct Reward: 1.15

Success: Error recovery with JOIN

Question: "Which employee received the biggest bonus?"

  1. describe("employee") → Employee_ID, Name, Age, City
  2. query("SELECT Name FROM employee ORDER BY Salary DESC LIMIT 1") → Error: no such column: Salary
  3. describe("evaluation") → Employee_ID, Year_awarded, Bonus
  4. query("SELECT Name FROM evaluation JOIN employee ON evaluation.Employee_ID = employee.Employee_ID ORDER BY Bonus DESC LIMIT 1") → Louis Deacon
  5. answer("Louis Deacon") → correct Reward: 1.13

Success: Subquery with comparison

Question: "How many cars have greater accelerate than the one with the most horsepower?"

  1. describe("cars_data") → Id, MPG, Cylinders, Edispl, Horsepower, Weight, Accelerate, Year
  2. query("SELECT COUNT(*) FROM cars_data WHERE Accelerate > (SELECT Accelerate FROM cars_data ORDER BY Horsepower DESC LIMIT 1)") → 39
  3. answer("39") → correct Reward: 1.15

Failure: Column name hallucination

Question: "What are all the makers and models?"

  1. describe("cars_data") → Id, MPG, Cylinders, Edispl, Horsepower, Weight, Accelerate, Year (no "model" column)
  2. query("SELECT Id, model FROM cars_data") → Error: no such column: model
  3. describe("model_list") → ModelId, Maker, Model
  4. query("SELECT Maker, Model FROM model_list") → correct results (truncated to 20 rows)
  5. answer("1 | amc\n2 | audi\n3 | bmw...") → incorrect (truncated, missing rows) Model found the right data but answer was truncated by the 20-row display limit.

Failure: FK chain reasoning

Question: "Which city has most number of arriving flights?"

  1. describe("AIRPORTS") → City, AirportCode, AirportName, Country, CountryAbbrev
  2. query("SELECT CITY, COUNT(*) FROM AIRPORTS GROUP BY CITY ORDER BY COUNT(*) DESC LIMIT 1") → Albany | 4
  3. answer("Albany") → incorrect Model counted airports per city, not flights per city. Needed to JOIN flights→airports on DestAirport=AirportCode, then GROUP BY City. The FK chain reasoning is missing.

Failure: Invented table

Question: "What are all details of students who registered but did not attend any course?"

  1. describe("Student_Course_Registrations") → student_id, course_id, registration_date
  2. query("SELECT ... FROM Student_Course_Registrations WHERE course_id NOT INSELECT ...") → syntax error (missing space)
  3. describe("Course_Attendance") → Error: table not found. Available: Student_Course_Attendance Model invented "Course_Attendance" instead of reading the available table list. Cascading errors consumed the full step budget.

Reward Architecture Details

Three-layer structure

L1 Operational (every step):
  +0.02  exec_ok
  +0.01  new_info (unique SQL hash)
  -0.03  repeat penalty
  -0.02  step cost

L2 Progress (QUERY only):
  Delta from previous binned progress × 0.15
  Binned to {0, 0.25, 0.5, 0.75, 1.0}

L3 Terminal (ANSWER only):
  +1.0 correct, 0.0 wrong

Per-step clip: [-0.10, 0.15]

Why potential-based shaping matters

  • Ng et al. (1999): F(s,s') = Φ(s') - Φ(s) preserves optimal policy
  • Our delta progress IS potential-based with γ=1
  • Cumulative caps are NOT potential-based (depend on trajectory history)
  • Without this guarantee, agents learn to farm exploration rewards

Anti-farming mechanisms

  • Hard budget (15 steps)
  • Step cost (-0.02)
  • Repeat penalty (-0.03)
  • Terminal dominance (1.0 vs ~0.3 max exploration)
  • Per-step clip [-0.10, 0.15]
  • Post-episode penalty (-0.3)

Eval Results (N=50, 2026-04-11)

Comparison table (for blog, N=50 with retry, 2026-04-11, Run B)

Method Accuracy Avg Reward Avg Steps Parse Rate Parse Fails Budget Exhaust
zero-shot 0.0% 0.007 12.4 23.6% 434 38
1-shot 2.0% 0.061 14.0 17.0% 537 46
3-shot 0.0% 0.057 14.8 19.0% 551 49
GRPO v1 30.0% 0.386 3.5 100.0% 0 0
GRPO v2 24.0% 0.321 3.6 95.1% 8 1

Previous run (Run A, same day, same seed)

Method Accuracy Avg Reward Avg Steps Parse Rate Budget Exhaust
zero-shot 0.0% 0.016 10.8 28.1% 31/50
1-shot 0.0% 0.031 14.8 15.6% 49/50
3-shot 0.0% 0.041 13.8 20.3% 44/50
GRPO v1 28.0% 0.355 4.0 95.0% 2/50
GRPO v2 32.0% 0.400 3.7 87.1% 2/50

Run-to-run variation (important for blog)

v1 and v2 show similar accuracy with noise at N=50: v1 scored 28% then 30%, v2 scored 32% then 24%. The difference between checkpoints is within run-to-run variation (6-8pp swing). For the blog, report both as "28-32% accuracy" or "roughly 30%" rather than claiming one is better. The meaningful comparison is GRPO (~30%) vs base model (0-2%), not v1 vs v2.

The variation comes from: (1) temperature sampling during generation, (2) question selection randomness at N=50, (3) v2's "Task complete." abstention pattern — on borderline questions, whether v2 guesses or abstains varies by run, causing larger accuracy swings.

Note: parse failures no longer end episodes — model gets a no-op DESCRIBE and continues. This gives base models the same step budget as trained models, but they waste it on repeated parse failures (avg 11-15 steps vs GRPO's 3.5-4.0).

Key observations from N=50 eval (with retry, 2 runs)

  1. ~30% accuracy for GRPO vs 0-2% for base model across all conditions. v1 and v2 are statistically indistinguishable (28-30% vs 24-32% across runs).
  2. Run-to-run variation is ~6-8pp — v1 scored 28% then 30%, v2 scored 32% then 24%. At N=50, don't over-interpret small differences between checkpoints.
  3. Base model parse failure loop — without episode termination on parse failure, base models burn their entire 15-step budget repeating the same non-tool-call output (e.g., "- Single value: []" 11 times). 46-49/50 1-shot episodes hit budget exhaustion.
  4. GRPO solves format compliance — 95-100% parse rate (v1) vs 17-28% for base. The trained model almost always produces valid <tool_call> JSON.
  5. GRPO failure mode is SQL quality, not format — episodes with correct tool-call format but wrong SQL/answer dominate GRPO failures.
  6. Extra turns don't help base models — more steps just mean more repeated failures. The fundamental gap is format compliance, not exploration budget.
  7. 1-shot occasionally gets lucky — scored 2% in Run B (1/50 correct), 0% in Run A. At N=50, a single lucky episode swings accuracy by 2pp.

v2 vs v1: similar accuracy, more parse failures — behavioral shift

Across two runs, v1 and v2 show overlapping accuracy ranges (28-30% vs 24-32%). The difference is within run-to-run variation at N=50. However, v2 consistently shows more parse failures (8-22 vs 0-8), revealing a behavioral shift from continued training:

  • v1 guesses more: v1 almost always calls answer(), even when uncertain. It submits wrong answers confidently (0 parse failures in Run B, 100% parse rate).
  • v2 gives up on hard questions: v2 produces "Task complete." output after multiple failed queries instead of calling answer(), producing parse failures. v2 learned that some questions are unsolvable.
  • Neither is clearly better: v2's caution helps on some runs (32% in Run A) and hurts on others (24% in Run B). The abstention behavior adds variance. For the blog, present them as equivalent (~30%) with a qualitative note about the behavioral difference.

The v2 parse failure pattern (from raw output):

[OK] DESCRIBE: country
[OK] QUERY: SELECT Name FROM country WHERE Population < ...
[PARSE FAIL] raw: Task complete.    ← gives up, doesn't call answer()
[PARSE FAIL] raw: Task complete.    ← repeats until budget

Compare v1 on the same type of question:

[OK] DESCRIBE: country
[OK] QUERY: SELECT Name FROM country WHERE ...
[OK] ANSWER: European cities and their names are: 42    ← wrong, but at least calls answer()

This is a form of calibrated uncertainty — v2 is better at knowing what it doesn't know. The incorrect answer reward of 0.0 (see learning #19 in session log) creates an avoid-answering incentive that v2 has partially internalized. A more generous incorrect-answer reward (e.g., +0.1 for attempting an answer in correct format) might recover these episodes.

For the blog narrative

The story is clear: GRPO teaches format compliance (0% → 95-100% parse rate) and strategic tool use (describe→query→answer in 3-4 steps). Base models waste 15 steps repeating parse failures. The ~30% accuracy ceiling (consistent across checkpoints and runs) comes from the 0.6B model's SQL reasoning capacity, not from the environment or training pipeline. The environment scales; the model doesn't. Report v1 and v2 as "roughly 30%" — the variation between runs is larger than the difference between checkpoints.

Format Mismatch Discovery (F011)

The three differences between eval and training

  1. role:tool vs role:user — Qwen3 renders role:"tool" as <|im_start|>user\n<tool_response>...</tool_response>, role:"user" as <|im_start|>user\nplain text. Same role token, different content structure.
  2. Structured tool_calls vs raw text — Training uses {"role":"assistant", "tool_calls":[{"function":{"name":"describe","arguments":"{...}"}}]}, eval was using {"role":"assistant", "content":"<tool_call>...</tool_call>"}.
  3. No separator vs \n\n — TRL appends reset() return directly to user message. Eval had question\n\ntable_hint.

Impact

Before fix: 0% accuracy across ALL conditions (zero-shot, 1-shot, 3-shot, GRPO checkpoint). After fix: 10% zero-shot, 30% 1-shot, 50% 3-shot on base model. GRPO checkpoint still 10%.

Lesson

Eval format matching is not a nice-to-have. It's a prerequisite for ANY measurement. We spent time debugging model quality when the problem was plumbing.

Multi-Turn SFT — Why It's Critical

Per-turn SFT (broken)

  • 347 examples, each one assistant turn
  • ~50% were describe calls
  • Model learned: "when asked a question, call describe"
  • With KL penalty, model stayed anchored to this single-action policy
  • Result: reward=0.00, all rollouts identical, advantage=0

Multi-turn SFT (working)

  • 120 examples, each a full describe→query→answer trajectory
  • assistant_only_loss via Qwen3 template patch ({% generation %} tags)
  • Model learned: the SEQUENCE describe→query→answer
  • With KL penalty, model explores within the multi-turn strategy
  • Result: reward climbs to 0.7 in Phase 1

Template patch detail

Qwen3's chat template lacks {% generation %} tags needed by TRL for assistant_only_loss. We patch the template before SFT, restore original before GRPO (TRL does exact-match checks on template string in add_response_schema() and get_training_chat_template()).

The 0.6B Capacity Ceiling

What works at 0.6B

  • Single-table queries: COUNT, SUM, AVG, MIN, MAX, GROUP BY, HAVING, ORDER BY, LIMIT
  • Simple JOINs between 2 tables when FK is obvious (evaluation.Employee_ID = employee.Employee_ID)
  • WHERE with LIKE, IN, BETWEEN, NOT IN subqueries
  • Answer formatting: comma lists, pipe-delimited rows, [] for empty
  • Error recovery: describe after SQL error, retry with correct column names
  • sample tool usage (learned in Run 6, inconsistent later)

What breaks at 0.6B

  • FK chain reasoning: 3+ table joins (Documents→Templates→Ref_Template_Types)
  • Column name fidelity: reads FullName from describe, writes full_name in SQL
  • Ambiguous column resolution: joins with same column name in both tables
  • Complex subqueries: INTERSECT, EXCEPT, correlated subqueries with HAVING
  • "stadium without concert" pattern: NOT IN with JOIN to get names
  • Aggregate + GROUP BY + HAVING chains on multi-table joins

The hallucination pattern

The model describes a table and sees the exact column names. Then it writes SQL using pretrained column names that don't match. This isn't a memory problem — the schema is in the context window. It's a weight problem — pretraining biases override in-context information at 0.6B scale.

Thinking Mode Observations (Run 8)

Benefits

  • Reasons through SQL errors: "no such column: airport_code" → <think> block → tries AirportCode
  • Empty <think></think> on easy questions — token-efficient, emergent behavior
  • Multi-step join planning in think blocks

New failure mode

~23% of rollouts: <think>assistant<think>assistant... repeating until token limit. Model fails to close </think> tag. Burns entire token budget with garbage.

OOM risk

Thinking blocks consume more tokens → higher peak memory during generation. Phase 2 crashed at step 182/467 with max_new_tokens=1280. Fix: reduce to 1024, or reduce num_generations from 4 to 3.

Narrative Threads for Blog

"The environment is the product"

From VISION.md: "SQLEnv is a reinforcement learning environment — not a text-to-SQL model. The environment is the product." The trained agent demonstrates that the environment works, but the contribution is the action space, reward architecture, and episode structure.

"Small model showing improvement proves more than large model with marginal gains"

A 0.6B model going from 0% to 10% accuracy with clear strategic behavior (describe→query→answer, error recovery) proves the environment produces learning signal. A 70B model with marginal gains would prove nothing about the environment.

"Analysts don't write perfect queries from scratch"

The hook. Frame the problem as: text-to-SQL evaluates guessing, not investigating. SQLEnv evaluates the process.

"Dense rewards need theory"

Potential-based shaping isn't just good practice — it's the guarantee that the agent optimizes for the right objective. Without it, we saw agents farming exploration rewards.

"Multi-turn SFT teaches strategy, not actions"

The difference between per-turn and multi-turn SFT is the difference between teaching vocabulary and teaching conversation.

References for Blog

  • Ng, Harada, Russell (1999). Policy invariance under reward transformations. ICML.
  • DeepSeek-AI (2025). DeepSeek-R1.
  • Shao et al. (2024). DeepSeek-Math: GRPO.
  • Sullivan et al. (2025/2026). GRPO is Secretly a Process Reward Model. ICLR 2026.
  • Yu et al. (2018). Spider dataset.
  • Li et al. (2023). BIRD benchmark.
  • TIPS (2026). Turn-Level Information-Potential Reward Shaping.
  • ToolRL (2025). Reward is All Tool Learning Needs.
  • StepTool (2024). Step-grained RL for Tool Learning.

Showcase Notebook Transcripts (for blog)

Random agent episode (seed=7) — comedic failure

Question: "Count the number of paragraphs."

  SAMPLE Paragraphs         → reward=0.015
  SAMPLE Documents           → reward=0.015
  DESCRIBE Documents         → reward=0.015
  SAMPLE Documents           → reward=0.015  (repeat)
  DESCRIBE Documents         → reward=0.015  (repeat)
  DESCRIBE Documents         → reward=0.015  (repeat)
  DESCRIBE Templates         → reward=0.015
  SAMPLE Documents           → reward=0.015  (repeat)
  DESCRIBE Documents         → reward=0.015  (repeat)
  QUERY SELECT * FROM "Templates" LIMIT 5  → reward=0.0625
  DESCRIBE Documents         → reward=0.015  (repeat)
  DESCRIBE Paragraphs        → reward=0.015
  QUERY SELECT * FROM "Paragraphs" LIMIT 5 → reward=0.025
  QUERY SELECT * FROM "Documents" LIMIT 5  → reward=0.025
  ANSWER 76 | 20 | Robbin CV | y | None   → reward=0.000 (incorrect)

Total reward: 0.278. Used all 15 steps. Described Documents 5 times. Answered with a random row from the wrong table. Never wrote SELECT COUNT(*).

Oracle agent episode (seed=0) — clean solve

Question: "List the id of students who registered some courses and the number of their registered courses?"

  Step 1: DESCRIBE student_course_registrations
    → student_id INTEGER, course_id INTEGER, registration_date DATETIME
    → reward: +0.015

  Step 2: DESCRIBE students
    → student_id INTEGER, student_details VARCHAR(255)
    → reward: +0.015

  Step 3: QUERY
    SELECT T1.student_id, count(*)
    FROM students AS T1
    JOIN student_course_registrations AS T2
      ON T1.student_id = T2.student_id
    GROUP BY T1.student_id
    → 111|1, 121|2, 131|1, 141|2, 151|1, 161|1, 171|1
    → reward: +0.150

  Step 4: ANSWER [[111,1],[121,2],[131,1],[141,2],[151,1],[161,1],[171,1]]
    → correct
    → reward: +1.000

Total reward: 1.180. 4 steps, efficient. Exploration (L1+L2): 0.180, Terminal (L3): 1.000.

Baseline comparison (50 episodes each)

Policy Success Rate Avg Reward Avg Steps
Random 0.0% 0.247 15.0
Oracle 100.0% 1.168 3.5

The gap between 0.247 and 1.168 defines the learning space. A trained agent lands somewhere between.

Reward constants (from server/reward.py)

+0.02   successful execution (no errors)
+0.01   new information (unique query)
-0.02   step cost (every action)
-0.03   repeat penalty (duplicate SQL)
[-0.10, +0.15]  per-step clipping range
+1.0    correct answer (terminal)
+0.0    wrong answer (terminal)

Terminal dominance: max exploration over 15 steps is ~0.3 (15 * 0.02 best case), while a correct answer adds 1.0.

Competition Context

OpenEnv Challenge (our target)

  • Sponsors: PyTorch/Meta, HuggingFace, Unsloth
  • Prize: $10K HF credits
  • Judging: primarily blog-based
  • Criteria: Creative OpenEnv use, Technical excellence, Storytelling, Open source demo, Green Agent wrapper
  • Green Agent wrapper is an explicit judging criterion in the OpenEnv Challenge.

Deliverables

  1. Environment on HF Hub — live at https://huggingface.co/spaces/hjerpe/sql_env (pushed 2026-03-29; Docker image at registry.hf.space/hjerpe-sql_env:latest)
  2. Training notebooks/scripts on GitHub — notebooks/train_grpo.ipynb, notebooks/compare_methods.ipynb, notebooks/showcase_sqlenv.ipynb
  3. Blog on HuggingFace — docs/blog-post-v1.md (draft)

TRL integration status (already done — do not re-research)

training/trl_adapter.py::SQLEnvTRL is a TRL-native environment_factory class: reset() + named tool methods describe() / sample() / query() / answer() with docstrings TRL uses to build the tool schema. The notebook passes it directly: GRPOTrainer(..., environment_factory=SQLEnvTRL, reward_funcs=[sql_env_reward_func]). The adapter runs SQLEnvironment in-process (not a WebSocket client to the HF Space) — intentional, because training opens N parallel sessions and the Space defaults to 1.

Competitive landscape

  • SQL Repair (WALKMAN303) — buggy SQL fix, simpler than our multi-turn exploration
  • Calendar Gym (Turing) — featured on HF blog, real-world framing + failure analysis
  • OpenSec — cybersecurity with arXiv paper, adversarial evidence injection
  • Our position: no interactive SQL exploration environment exists. SQL Repair is single-turn fix-it; we're multi-turn strategy discovery.

What winning entries do

  1. Stakes framing — "this matters in production"
  2. Concrete failure analysis with numbers
  3. Contrast (random vs trained vs oracle)
  4. Real data, not toy puzzles
  5. Non-obvious insights from training

Green Agent Evaluator

What it is

OpenEnv's standardized evaluation wrapper pattern. A Policy protocol with evaluate(env, policy, n_episodes, seed) that runs any policy through the environment and reports aggregate metrics. Listed as an explicit judging criterion in the OpenEnv Challenge.

Implementation

  • evaluation/policies.pyPolicy protocol, evaluate() harness, RandomPolicy, EpisodeResult, EvaluationResult
  • evaluation/oracle_policy.pyOraclePolicy baseline (runs gold SQL)
  • tests/test_evaluation.py — 17 tests, all passing (unit + integration)

How it works

from sql_env.evaluation import evaluate, RandomPolicy, OraclePolicy

# Run 50 episodes with random policy
result = evaluate(env, RandomPolicy(seed=0), n_episodes=50, seed=0)
print(f"Success: {result.success_rate:.1%}, Reward: {result.avg_reward:.3f}")

# Run with trained policy (any class with select_action method)
result = evaluate(env, trained_policy, n_episodes=50, seed=42)

Where it's used

  • notebooks/showcase_sqlenv.ipynb — Random vs Oracle baseline comparison
  • notebooks/compare_methods.ipynb — All 5 conditions (zero-shot, 1-shot, 3-shot, GRPO v1, v2) run through evaluate()

Key design choices

  • Error isolation: one episode crashing doesn't kill the run — logged as EpisodeResult(error=str(exc))
  • Deterministic seeding: seed + episode_index per episode for reproducibility
  • Protocol-based: any class with select_action(observation) -> action works — no inheritance required
  • Aggregate + per-episode: EvaluationResult has both summary metrics and full episodes list for drill-down

For the blog

The Green Agent evaluator is the backbone of all evaluation. Every result in the comparison table flows through evaluate(). The trained GRPO model is wrapped in LLMToolCallingPolicy (which implements the Policy protocol) and evaluated identically to the Random and Oracle baselines. This is the standardized, reproducible evaluation pattern the challenge asks for.

Files to Reference

File Relevance
docs/blog-outline.md Section structure template
docs/blog-post.md Current draft
docs/design-docs/reward-shaping-research.md Reward theory + references
docs/exploration/grpo-training-session-log.md All 9 runs detailed
vision/VISION.md Product vision, success metrics
training/trl_adapter.py Environment adapter code
notebooks/compare_methods.ipynb Eval notebook
notebooks/train_grpo.ipynb Training notebook