Spaces:

hjerpe
/

sql_env

Running

App Files Files Community

sql_env / docs /blog-material.md

hjerpe

Upload folder using huggingface_hub

9e64e71 verified 6 days ago

preview code

raw

history blame contribute delete

24.3 kB

Blog Material — Raw Knowledge Dump

Reference file for writing the SQLEnv blog post. Contains observations, training data, failure modes, and narrative threads extracted from 9 training runs. The blog outline is at docs/blog-outline.md, the draft at docs/blog-post.md.

Training Run Summary

Run progression (what each run taught us)

Run 1: SFT works, GRPO plateaus — no penalty for post-episode waste
Run 2: Qwen3 tokenizer expands dict args to null params — root cause of first collapse
Run 3: Without KL penalty, GRPO drifts structural tokens (<tool_response> instead of <tool_call>)
Run 4: KL penalty + reference model = OOM on L4
Run 5: KL too conservative with single-turn SFT — model only calls describe, never queries
Run 6: Multi-turn SFT breakthrough — first successful training, reward -0.1→0.7
Run 7: Repeat penalty, stable training, multi-table weakness exposed
Run 8: Thinking mode helps error recovery, introduces <think>assistant degenerate loop, OOM crash
Run 9: v2 continued training confirms ceiling — more epochs don't help medium questions

Key numbers

Metric	Value
Model	Qwen3-0.6B (target <0.5B per VISION.md, using 0.6B)
Hardware	Colab L4 (24GB VRAM)
SFT data	120 multi-turn trajectories, assistant_only_loss
Phase 1	435 easy questions, beta=0.04, 512 tokens, ~3.5h
Phase 2	467 easy+medium, beta=0.0, 1024 tokens, ~3.8h
Total training	~7.3h on L4 for both phases
Phase 1 reward	-0.1 → 0.5-0.7 (clear upward trend)
Phase 2 reward	oscillates 0.01-1.15, mean ~0.5 (no improvement over Phase 1)
Parse rate	>98% (valid tool-call JSON)
Eval accuracy (GRPO v1, N=50)	28-30% across runs (vs 0% base model)
Eval accuracy (GRPO v2, N=50)	24-32% across runs (within noise of v1)
Eval accuracy (base, 3-shot)	0% (format compliance only 19-20%)

Concrete Episode Examples (for blog)

Success: Simple aggregation (3 tool calls)

Question: "What is the total bonus given in all evaluations?"

describe("evaluation") → Employee_ID, Year_awarded, Bonus columns
query("SELECT SUM(Bonus) FROM evaluation") → 19500.0
answer("19500.0") → correct Reward: 1.15

Success: Error recovery with JOIN

Question: "Which employee received the biggest bonus?"

describe("employee") → Employee_ID, Name, Age, City
query("SELECT Name FROM employee ORDER BY Salary DESC LIMIT 1") → Error: no such column: Salary
describe("evaluation") → Employee_ID, Year_awarded, Bonus
query("SELECT Name FROM evaluation JOIN employee ON evaluation.Employee_ID = employee.Employee_ID ORDER BY Bonus DESC LIMIT 1") → Louis Deacon
answer("Louis Deacon") → correct Reward: 1.13

Success: Subquery with comparison

Question: "How many cars have greater accelerate than the one with the most horsepower?"

describe("cars_data") → Id, MPG, Cylinders, Edispl, Horsepower, Weight, Accelerate, Year
query("SELECT COUNT(*) FROM cars_data WHERE Accelerate > (SELECT Accelerate FROM cars_data ORDER BY Horsepower DESC LIMIT 1)") → 39
answer("39") → correct Reward: 1.15

Failure: Column name hallucination

Question: "What are all the makers and models?"

describe("cars_data") → Id, MPG, Cylinders, Edispl, Horsepower, Weight, Accelerate, Year (no "model" column)
query("SELECT Id, model FROM cars_data") → Error: no such column: model
describe("model_list") → ModelId, Maker, Model
query("SELECT Maker, Model FROM model_list") → correct results (truncated to 20 rows)
answer("1 | amc\n2 | audi\n3 | bmw...") → incorrect (truncated, missing rows) Model found the right data but answer was truncated by the 20-row display limit.

Failure: FK chain reasoning

Question: "Which city has most number of arriving flights?"

describe("AIRPORTS") → City, AirportCode, AirportName, Country, CountryAbbrev
query("SELECT CITY, COUNT(*) FROM AIRPORTS GROUP BY CITY ORDER BY COUNT(*) DESC LIMIT 1") → Albany | 4
answer("Albany") → incorrect Model counted airports per city, not flights per city. Needed to JOIN flights→airports on DestAirport=AirportCode, then GROUP BY City. The FK chain reasoning is missing.

Failure: Invented table

Question: "What are all details of students who registered but did not attend any course?"

describe("Student_Course_Registrations") → student_id, course_id, registration_date
query("SELECT ... FROM Student_Course_Registrations WHERE course_id NOT INSELECT ...") → syntax error (missing space)
describe("Course_Attendance") → Error: table not found. Available: Student_Course_Attendance Model invented "Course_Attendance" instead of reading the available table list. Cascading errors consumed the full step budget.

Reward Architecture Details

Three-layer structure

L1 Operational (every step):
  +0.02  exec_ok
  +0.01  new_info (unique SQL hash)
  -0.03  repeat penalty
  -0.02  step cost

L2 Progress (QUERY only):
  Delta from previous binned progress × 0.15
  Binned to {0, 0.25, 0.5, 0.75, 1.0}

L3 Terminal (ANSWER only):
  +1.0 correct, 0.0 wrong

Per-step clip: [-0.10, 0.15]

Why potential-based shaping matters

Ng et al. (1999): F(s,s') = Φ(s') - Φ(s) preserves optimal policy
Our delta progress IS potential-based with γ=1
Cumulative caps are NOT potential-based (depend on trajectory history)
Without this guarantee, agents learn to farm exploration rewards

Anti-farming mechanisms

Hard budget (15 steps)
Step cost (-0.02)
Repeat penalty (-0.03)
Terminal dominance (1.0 vs ~0.3 max exploration)
Per-step clip [-0.10, 0.15]
Post-episode penalty (-0.3)

Eval Results (N=50, 2026-04-11)

Comparison table (for blog, N=50 with retry, 2026-04-11, Run B)

Method	Accuracy	Avg Reward	Avg Steps	Parse Rate	Parse Fails	Budget Exhaust
zero-shot	0.0%	0.007	12.4	23.6%	434	38
1-shot	2.0%	0.061	14.0	17.0%	537	46
3-shot	0.0%	0.057	14.8	19.0%	551	49
GRPO v1	30.0%	0.386	3.5	100.0%	0	0
GRPO v2	24.0%	0.321	3.6	95.1%	8	1

Previous run (Run A, same day, same seed)

Method	Accuracy	Avg Reward	Avg Steps	Parse Rate	Budget Exhaust
zero-shot	0.0%	0.016	10.8	28.1%	31/50
1-shot	0.0%	0.031	14.8	15.6%	49/50
3-shot	0.0%	0.041	13.8	20.3%	44/50
GRPO v1	28.0%	0.355	4.0	95.0%	2/50
GRPO v2	32.0%	0.400	3.7	87.1%	2/50

Run-to-run variation (important for blog)

v1 and v2 show similar accuracy with noise at N=50: v1 scored 28% then 30%, v2 scored 32% then 24%. The difference between checkpoints is within run-to-run variation (~~6-8pp swing). For the blog, report both as "~~28-32% accuracy" or "roughly 30%" rather than claiming one is better. The meaningful comparison is GRPO (~30%) vs base model (0-2%), not v1 vs v2.

The variation comes from: (1) temperature sampling during generation, (2) question selection randomness at N=50, (3) v2's "Task complete." abstention pattern — on borderline questions, whether v2 guesses or abstains varies by run, causing larger accuracy swings.

Note: parse failures no longer end episodes — model gets a no-op DESCRIBE and continues. This gives base models the same step budget as trained models, but they waste it on repeated parse failures (avg 11-15 steps vs GRPO's 3.5-4.0).

Key observations from N=50 eval (with retry, 2 runs)

~30% accuracy for GRPO vs 0-2% for base model across all conditions. v1 and v2 are statistically indistinguishable (28-30% vs 24-32% across runs).
Run-to-run variation is ~6-8pp — v1 scored 28% then 30%, v2 scored 32% then 24%. At N=50, don't over-interpret small differences between checkpoints.
Base model parse failure loop — without episode termination on parse failure, base models burn their entire 15-step budget repeating the same non-tool-call output (e.g., "- Single value: []" 11 times). 46-49/50 1-shot episodes hit budget exhaustion.
GRPO solves format compliance — 95-100% parse rate (v1) vs 17-28% for base. The trained model almost always produces valid <tool_call> JSON.
GRPO failure mode is SQL quality, not format — episodes with correct tool-call format but wrong SQL/answer dominate GRPO failures.
Extra turns don't help base models — more steps just mean more repeated failures. The fundamental gap is format compliance, not exploration budget.
1-shot occasionally gets lucky — scored 2% in Run B (1/50 correct), 0% in Run A. At N=50, a single lucky episode swings accuracy by 2pp.

v2 vs v1: similar accuracy, more parse failures — behavioral shift

Across two runs, v1 and v2 show overlapping accuracy ranges (28-30% vs 24-32%). The difference is within run-to-run variation at N=50. However, v2 consistently shows more parse failures (8-22 vs 0-8), revealing a behavioral shift from continued training:

v1 guesses more: v1 almost always calls answer(), even when uncertain. It submits wrong answers confidently (0 parse failures in Run B, 100% parse rate).
v2 gives up on hard questions: v2 produces "Task complete." output after multiple failed queries instead of calling answer(), producing parse failures. v2 learned that some questions are unsolvable.
Neither is clearly better: v2's caution helps on some runs (32% in Run A) and hurts on others (24% in Run B). The abstention behavior adds variance. For the blog, present them as equivalent (~30%) with a qualitative note about the behavioral difference.

The v2 parse failure pattern (from raw output):

[OK] DESCRIBE: country
[OK] QUERY: SELECT Name FROM country WHERE Population < ...
[PARSE FAIL] raw: Task complete.    ← gives up, doesn't call answer()
[PARSE FAIL] raw: Task complete.    ← repeats until budget

Compare v1 on the same type of question:

[OK] DESCRIBE: country
[OK] QUERY: SELECT Name FROM country WHERE ...
[OK] ANSWER: European cities and their names are: 42    ← wrong, but at least calls answer()

This is a form of calibrated uncertainty — v2 is better at knowing what it doesn't know. The incorrect answer reward of 0.0 (see learning #19 in session log) creates an avoid-answering incentive that v2 has partially internalized. A more generous incorrect-answer reward (e.g., +0.1 for attempting an answer in correct format) might recover these episodes.

For the blog narrative

The story is clear: GRPO teaches format compliance (0% → 95-100% parse rate) and strategic tool use (describe→query→answer in 3-4 steps). Base models waste 15 steps repeating parse failures. The ~30% accuracy ceiling (consistent across checkpoints and runs) comes from the 0.6B model's SQL reasoning capacity, not from the environment or training pipeline. The environment scales; the model doesn't. Report v1 and v2 as "roughly 30%" — the variation between runs is larger than the difference between checkpoints.

Format Mismatch Discovery (F011)

The three differences between eval and training

role:tool vs role:user — Qwen3 renders role:"tool" as <|im_start|>user\n<tool_response>...</tool_response>, role:"user" as <|im_start|>user\nplain text. Same role token, different content structure.
Structured tool_calls vs raw text — Training uses {"role":"assistant", "tool_calls":[{"function":{"name":"describe","arguments":"{...}"}}]}, eval was using {"role":"assistant", "content":"<tool_call>...</tool_call>"}.
No separator vs \n\n — TRL appends reset() return directly to user message. Eval had question\n\ntable_hint.

Impact

Before fix: 0% accuracy across ALL conditions (zero-shot, 1-shot, 3-shot, GRPO checkpoint). After fix: 10% zero-shot, 30% 1-shot, 50% 3-shot on base model. GRPO checkpoint still 10%.

Lesson

Eval format matching is not a nice-to-have. It's a prerequisite for ANY measurement. We spent time debugging model quality when the problem was plumbing.

Multi-Turn SFT — Why It's Critical

Per-turn SFT (broken)

347 examples, each one assistant turn
~50% were describe calls
Model learned: "when asked a question, call describe"
With KL penalty, model stayed anchored to this single-action policy
Result: reward=0.00, all rollouts identical, advantage=0

Multi-turn SFT (working)

120 examples, each a full describe→query→answer trajectory
assistant_only_loss via Qwen3 template patch ({% generation %} tags)
Model learned: the SEQUENCE describe→query→answer
With KL penalty, model explores within the multi-turn strategy
Result: reward climbs to 0.7 in Phase 1

Template patch detail

Qwen3's chat template lacks {% generation %} tags needed by TRL for assistant_only_loss. We patch the template before SFT, restore original before GRPO (TRL does exact-match checks on template string in add_response_schema() and get_training_chat_template()).

The 0.6B Capacity Ceiling

What works at 0.6B

Single-table queries: COUNT, SUM, AVG, MIN, MAX, GROUP BY, HAVING, ORDER BY, LIMIT
Simple JOINs between 2 tables when FK is obvious (evaluation.Employee_ID = employee.Employee_ID)
WHERE with LIKE, IN, BETWEEN, NOT IN subqueries
Answer formatting: comma lists, pipe-delimited rows, [] for empty
Error recovery: describe after SQL error, retry with correct column names
sample tool usage (learned in Run 6, inconsistent later)

What breaks at 0.6B

FK chain reasoning: 3+ table joins (Documents→Templates→Ref_Template_Types)
Column name fidelity: reads FullName from describe, writes full_name in SQL
Ambiguous column resolution: joins with same column name in both tables
Complex subqueries: INTERSECT, EXCEPT, correlated subqueries with HAVING
"stadium without concert" pattern: NOT IN with JOIN to get names
Aggregate + GROUP BY + HAVING chains on multi-table joins

The hallucination pattern

The model describes a table and sees the exact column names. Then it writes SQL using pretrained column names that don't match. This isn't a memory problem — the schema is in the context window. It's a weight problem — pretraining biases override in-context information at 0.6B scale.

Thinking Mode Observations (Run 8)

Benefits

Reasons through SQL errors: "no such column: airport_code" → <think> block → tries AirportCode
Empty <think></think> on easy questions — token-efficient, emergent behavior
Multi-step join planning in think blocks

New failure mode

~23% of rollouts: <think>assistant<think>assistant... repeating until token limit. Model fails to close </think> tag. Burns entire token budget with garbage.

OOM risk

Thinking blocks consume more tokens → higher peak memory during generation. Phase 2 crashed at step 182/467 with max_new_tokens=1280. Fix: reduce to 1024, or reduce num_generations from 4 to 3.

Narrative Threads for Blog

"The environment is the product"

From VISION.md: "SQLEnv is a reinforcement learning environment — not a text-to-SQL model. The environment is the product." The trained agent demonstrates that the environment works, but the contribution is the action space, reward architecture, and episode structure.

"Small model showing improvement proves more than large model with marginal gains"

A 0.6B model going from 0% to 10% accuracy with clear strategic behavior (describe→query→answer, error recovery) proves the environment produces learning signal. A 70B model with marginal gains would prove nothing about the environment.

"Analysts don't write perfect queries from scratch"

The hook. Frame the problem as: text-to-SQL evaluates guessing, not investigating. SQLEnv evaluates the process.

"Dense rewards need theory"

Potential-based shaping isn't just good practice — it's the guarantee that the agent optimizes for the right objective. Without it, we saw agents farming exploration rewards.

"Multi-turn SFT teaches strategy, not actions"

The difference between per-turn and multi-turn SFT is the difference between teaching vocabulary and teaching conversation.

References for Blog

Ng, Harada, Russell (1999). Policy invariance under reward transformations. ICML.
DeepSeek-AI (2025). DeepSeek-R1.
Shao et al. (2024). DeepSeek-Math: GRPO.
Sullivan et al. (2025/2026). GRPO is Secretly a Process Reward Model. ICLR 2026.
Yu et al. (2018). Spider dataset.
Li et al. (2023). BIRD benchmark.
TIPS (2026). Turn-Level Information-Potential Reward Shaping.
ToolRL (2025). Reward is All Tool Learning Needs.
StepTool (2024). Step-grained RL for Tool Learning.

Showcase Notebook Transcripts (for blog)

Random agent episode (seed=7) — comedic failure

Question: "Count the number of paragraphs."

  SAMPLE Paragraphs         → reward=0.015
  SAMPLE Documents           → reward=0.015
  DESCRIBE Documents         → reward=0.015
  SAMPLE Documents           → reward=0.015  (repeat)
  DESCRIBE Documents         → reward=0.015  (repeat)
  DESCRIBE Documents         → reward=0.015  (repeat)
  DESCRIBE Templates         → reward=0.015
  SAMPLE Documents           → reward=0.015  (repeat)
  DESCRIBE Documents         → reward=0.015  (repeat)
  QUERY SELECT * FROM "Templates" LIMIT 5  → reward=0.0625
  DESCRIBE Documents         → reward=0.015  (repeat)
  DESCRIBE Paragraphs        → reward=0.015
  QUERY SELECT * FROM "Paragraphs" LIMIT 5 → reward=0.025
  QUERY SELECT * FROM "Documents" LIMIT 5  → reward=0.025
  ANSWER 76 | 20 | Robbin CV | y | None   → reward=0.000 (incorrect)

Total reward: 0.278. Used all 15 steps. Described Documents 5 times. Answered with a random row from the wrong table. Never wrote SELECT COUNT(*).

Oracle agent episode (seed=0) — clean solve

Question: "List the id of students who registered some courses and the number of their registered courses?"

  Step 1: DESCRIBE student_course_registrations
    → student_id INTEGER, course_id INTEGER, registration_date DATETIME
    → reward: +0.015

  Step 2: DESCRIBE students
    → student_id INTEGER, student_details VARCHAR(255)
    → reward: +0.015

  Step 3: QUERY
    SELECT T1.student_id, count(*)
    FROM students AS T1
    JOIN student_course_registrations AS T2
      ON T1.student_id = T2.student_id
    GROUP BY T1.student_id
    → 111|1, 121|2, 131|1, 141|2, 151|1, 161|1, 171|1
    → reward: +0.150

  Step 4: ANSWER [[111,1],[121,2],[131,1],[141,2],[151,1],[161,1],[171,1]]
    → correct
    → reward: +1.000

Total reward: 1.180. 4 steps, efficient. Exploration (L1+L2): 0.180, Terminal (L3): 1.000.

Baseline comparison (50 episodes each)

Policy	Success Rate	Avg Reward	Avg Steps
Random	0.0%	0.247	15.0
Oracle	100.0%	1.168	3.5

The gap between 0.247 and 1.168 defines the learning space. A trained agent lands somewhere between.

Reward constants (from server/reward.py)

+0.02   successful execution (no errors)
+0.01   new information (unique query)
-0.02   step cost (every action)
-0.03   repeat penalty (duplicate SQL)
[-0.10, +0.15]  per-step clipping range
+1.0    correct answer (terminal)
+0.0    wrong answer (terminal)

Terminal dominance: max exploration over 15 steps is ~0.3 (15 * 0.02 best case), while a correct answer adds 1.0.

Competition Context

OpenEnv Challenge (our target)

Sponsors: PyTorch/Meta, HuggingFace, Unsloth
Prize: $10K HF credits
Judging: primarily blog-based
Criteria: Creative OpenEnv use, Technical excellence, Storytelling, Open source demo, Green Agent wrapper
Green Agent wrapper is an explicit judging criterion in the OpenEnv Challenge.

Deliverables

Environment on HF Hub — live at https://huggingface.co/spaces/hjerpe/sql_env (pushed 2026-03-29; Docker image at registry.hf.space/hjerpe-sql_env:latest)
Training notebooks/scripts on GitHub — notebooks/train_grpo.ipynb, notebooks/compare_methods.ipynb, notebooks/showcase_sqlenv.ipynb
Blog on HuggingFace — docs/blog-post-v1.md (draft)

TRL integration status (already done — do not re-research)

training/trl_adapter.py::SQLEnvTRL is a TRL-native environment_factory class: reset() + named tool methods describe() / sample() / query() / answer() with docstrings TRL uses to build the tool schema. The notebook passes it directly: GRPOTrainer(..., environment_factory=SQLEnvTRL, reward_funcs=[sql_env_reward_func]). The adapter runs SQLEnvironment in-process (not a WebSocket client to the HF Space) — intentional, because training opens N parallel sessions and the Space defaults to 1.

Competitive landscape

SQL Repair (WALKMAN303) — buggy SQL fix, simpler than our multi-turn exploration
Calendar Gym (Turing) — featured on HF blog, real-world framing + failure analysis
OpenSec — cybersecurity with arXiv paper, adversarial evidence injection
Our position: no interactive SQL exploration environment exists. SQL Repair is single-turn fix-it; we're multi-turn strategy discovery.

What winning entries do

Stakes framing — "this matters in production"
Concrete failure analysis with numbers
Contrast (random vs trained vs oracle)
Real data, not toy puzzles
Non-obvious insights from training

Green Agent Evaluator

What it is

OpenEnv's standardized evaluation wrapper pattern. A Policy protocol with evaluate(env, policy, n_episodes, seed) that runs any policy through the environment and reports aggregate metrics. Listed as an explicit judging criterion in the OpenEnv Challenge.

Implementation

evaluation/policies.py — Policy protocol, evaluate() harness, RandomPolicy, EpisodeResult, EvaluationResult
evaluation/oracle_policy.py — OraclePolicy baseline (runs gold SQL)
tests/test_evaluation.py — 17 tests, all passing (unit + integration)

How it works

from sql_env.evaluation import evaluate, RandomPolicy, OraclePolicy

# Run 50 episodes with random policy
result = evaluate(env, RandomPolicy(seed=0), n_episodes=50, seed=0)
print(f"Success: {result.success_rate:.1%}, Reward: {result.avg_reward:.3f}")

# Run with trained policy (any class with select_action method)
result = evaluate(env, trained_policy, n_episodes=50, seed=42)

Where it's used

notebooks/showcase_sqlenv.ipynb — Random vs Oracle baseline comparison
notebooks/compare_methods.ipynb — All 5 conditions (zero-shot, 1-shot, 3-shot, GRPO v1, v2) run through evaluate()

Key design choices

Error isolation: one episode crashing doesn't kill the run — logged as EpisodeResult(error=str(exc))
Deterministic seeding: seed + episode_index per episode for reproducibility
Protocol-based: any class with select_action(observation) -> action works — no inheritance required
Aggregate + per-episode: EvaluationResult has both summary metrics and full episodes list for drill-down

For the blog

The Green Agent evaluator is the backbone of all evaluation. Every result in the comparison table flows through evaluate(). The trained GRPO model is wrapped in LLMToolCallingPolicy (which implements the Policy protocol) and evaluated identically to the Random and Oracle baselines. This is the standardized, reproducible evaluation pattern the challenge asks for.

Files to Reference

File	Relevance
`docs/blog-outline.md`	Section structure template
`docs/blog-post.md`	Current draft
`docs/design-docs/reward-shaping-research.md`	Reward theory + references
`docs/exploration/grpo-training-session-log.md`	All 9 runs detailed
`vision/VISION.md`	Product vision, success metrics
`training/trl_adapter.py`	Environment adapter code
`notebooks/compare_methods.ipynb`	Eval notebook
`notebooks/train_grpo.ipynb`	Training notebook