# Blog Material — Raw Knowledge Dump Reference file for writing the SQLEnv blog post. Contains observations, training data, failure modes, and narrative threads extracted from 9 training runs. The blog outline is at `docs/blog-outline.md`, the draft at `docs/blog-post.md`. ## Training Run Summary ### Run progression (what each run taught us) 1. **Run 1**: SFT works, GRPO plateaus — no penalty for post-episode waste 2. **Run 2**: Qwen3 tokenizer expands dict args to null params — root cause of first collapse 3. **Run 3**: Without KL penalty, GRPO drifts structural tokens (`` instead of ``) 4. **Run 4**: KL penalty + reference model = OOM on L4 5. **Run 5**: KL too conservative with single-turn SFT — model only calls describe, never queries 6. **Run 6**: Multi-turn SFT breakthrough — first successful training, reward -0.1→0.7 7. **Run 7**: Repeat penalty, stable training, multi-table weakness exposed 8. **Run 8**: Thinking mode helps error recovery, introduces `assistant` degenerate loop, OOM crash 9. **Run 9**: v2 continued training confirms ceiling — more epochs don't help medium questions ### Key numbers | Metric | Value | |--------|-------| | Model | Qwen3-0.6B (target <0.5B per VISION.md, using 0.6B) | | Hardware | Colab L4 (24GB VRAM) | | SFT data | 120 multi-turn trajectories, assistant_only_loss | | Phase 1 | 435 easy questions, beta=0.04, 512 tokens, ~3.5h | | Phase 2 | 467 easy+medium, beta=0.0, 1024 tokens, ~3.8h | | Total training | ~7.3h on L4 for both phases | | Phase 1 reward | -0.1 → 0.5-0.7 (clear upward trend) | | Phase 2 reward | oscillates 0.01-1.15, mean ~0.5 (no improvement over Phase 1) | | Parse rate | >98% (valid tool-call JSON) | | Eval accuracy (GRPO v1, N=50) | 28-30% across runs (vs 0% base model) | | Eval accuracy (GRPO v2, N=50) | 24-32% across runs (within noise of v1) | | Eval accuracy (base, 3-shot) | 0% (format compliance only 19-20%) | ## Concrete Episode Examples (for blog) ### Success: Simple aggregation (3 tool calls) Question: "What is the total bonus given in all evaluations?" 1. `describe("evaluation")` → Employee_ID, Year_awarded, Bonus columns 2. `query("SELECT SUM(Bonus) FROM evaluation")` → 19500.0 3. `answer("19500.0")` → correct Reward: 1.15 ### Success: Error recovery with JOIN Question: "Which employee received the biggest bonus?" 1. `describe("employee")` → Employee_ID, Name, Age, City 2. `query("SELECT Name FROM employee ORDER BY Salary DESC LIMIT 1")` → Error: no such column: Salary 3. `describe("evaluation")` → Employee_ID, Year_awarded, Bonus 4. `query("SELECT Name FROM evaluation JOIN employee ON evaluation.Employee_ID = employee.Employee_ID ORDER BY Bonus DESC LIMIT 1")` → Louis Deacon 5. `answer("Louis Deacon")` → correct Reward: 1.13 ### Success: Subquery with comparison Question: "How many cars have greater accelerate than the one with the most horsepower?" 1. `describe("cars_data")` → Id, MPG, Cylinders, Edispl, Horsepower, Weight, Accelerate, Year 2. `query("SELECT COUNT(*) FROM cars_data WHERE Accelerate > (SELECT Accelerate FROM cars_data ORDER BY Horsepower DESC LIMIT 1)")` → 39 3. `answer("39")` → correct Reward: 1.15 ### Failure: Column name hallucination Question: "What are all the makers and models?" 1. `describe("cars_data")` → Id, MPG, Cylinders, Edispl, Horsepower, Weight, Accelerate, Year (no "model" column) 2. `query("SELECT Id, model FROM cars_data")` → Error: no such column: model 3. `describe("model_list")` → ModelId, Maker, Model 4. `query("SELECT Maker, Model FROM model_list")` → correct results (truncated to 20 rows) 5. `answer("1 | amc\n2 | audi\n3 | bmw...")` → incorrect (truncated, missing rows) Model found the right data but answer was truncated by the 20-row display limit. ### Failure: FK chain reasoning Question: "Which city has most number of arriving flights?" 1. `describe("AIRPORTS")` → City, AirportCode, AirportName, Country, CountryAbbrev 2. `query("SELECT CITY, COUNT(*) FROM AIRPORTS GROUP BY CITY ORDER BY COUNT(*) DESC LIMIT 1")` → Albany | 4 3. `answer("Albany")` → incorrect Model counted airports per city, not flights per city. Needed to JOIN flights→airports on DestAirport=AirportCode, then GROUP BY City. The FK chain reasoning is missing. ### Failure: Invented table Question: "What are all details of students who registered but did not attend any course?" 1. `describe("Student_Course_Registrations")` → student_id, course_id, registration_date 2. `query("SELECT ... FROM Student_Course_Registrations WHERE course_id NOT INSELECT ...")` → syntax error (missing space) 3. `describe("Course_Attendance")` → Error: table not found. Available: Student_Course_Attendance Model invented "Course_Attendance" instead of reading the available table list. Cascading errors consumed the full step budget. ## Reward Architecture Details ### Three-layer structure ``` L1 Operational (every step): +0.02 exec_ok +0.01 new_info (unique SQL hash) -0.03 repeat penalty -0.02 step cost L2 Progress (QUERY only): Delta from previous binned progress × 0.15 Binned to {0, 0.25, 0.5, 0.75, 1.0} L3 Terminal (ANSWER only): +1.0 correct, 0.0 wrong Per-step clip: [-0.10, 0.15] ``` ### Why potential-based shaping matters - Ng et al. (1999): F(s,s') = Φ(s') - Φ(s) preserves optimal policy - Our delta progress IS potential-based with γ=1 - Cumulative caps are NOT potential-based (depend on trajectory history) - Without this guarantee, agents learn to farm exploration rewards ### Anti-farming mechanisms - Hard budget (15 steps) - Step cost (-0.02) - Repeat penalty (-0.03) - Terminal dominance (1.0 vs ~0.3 max exploration) - Per-step clip [-0.10, 0.15] - Post-episode penalty (-0.3) ## Eval Results (N=50, 2026-04-11) ### Comparison table (for blog, N=50 with retry, 2026-04-11, Run B) | Method | Accuracy | Avg Reward | Avg Steps | Parse Rate | Parse Fails | Budget Exhaust | |--------|----------|------------|-----------|------------|-------------|----------------| | zero-shot | 0.0% | 0.007 | 12.4 | 23.6% | 434 | 38 | | 1-shot | 2.0% | 0.061 | 14.0 | 17.0% | 537 | 46 | | 3-shot | 0.0% | 0.057 | 14.8 | 19.0% | 551 | 49 | | GRPO v1 | 30.0% | 0.386 | 3.5 | 100.0% | 0 | 0 | | GRPO v2 | 24.0% | 0.321 | 3.6 | 95.1% | 8 | 1 | ### Previous run (Run A, same day, same seed) | Method | Accuracy | Avg Reward | Avg Steps | Parse Rate | Budget Exhaust | |--------|----------|------------|-----------|------------|----------------| | zero-shot | 0.0% | 0.016 | 10.8 | 28.1% | 31/50 | | 1-shot | 0.0% | 0.031 | 14.8 | 15.6% | 49/50 | | 3-shot | 0.0% | 0.041 | 13.8 | 20.3% | 44/50 | | GRPO v1 | 28.0% | 0.355 | 4.0 | 95.0% | 2/50 | | GRPO v2 | 32.0% | 0.400 | 3.7 | 87.1% | 2/50 | ### Run-to-run variation (important for blog) v1 and v2 show similar accuracy with noise at N=50: v1 scored 28% then 30%, v2 scored 32% then 24%. The difference between checkpoints is **within run-to-run variation** (~6-8pp swing). For the blog, report both as "~28-32% accuracy" or "roughly 30%" rather than claiming one is better. The meaningful comparison is GRPO (~30%) vs base model (0-2%), not v1 vs v2. The variation comes from: (1) temperature sampling during generation, (2) question selection randomness at N=50, (3) v2's "Task complete." abstention pattern — on borderline questions, whether v2 guesses or abstains varies by run, causing larger accuracy swings. Note: parse failures no longer end episodes — model gets a no-op DESCRIBE and continues. This gives base models the same step budget as trained models, but they waste it on repeated parse failures (avg 11-15 steps vs GRPO's 3.5-4.0). ### Key observations from N=50 eval (with retry, 2 runs) 1. **~30% accuracy** for GRPO vs 0-2% for base model across all conditions. v1 and v2 are statistically indistinguishable (28-30% vs 24-32% across runs). 2. **Run-to-run variation is ~6-8pp** — v1 scored 28% then 30%, v2 scored 32% then 24%. At N=50, don't over-interpret small differences between checkpoints. 3. **Base model parse failure loop** — without episode termination on parse failure, base models burn their entire 15-step budget repeating the same non-tool-call output (e.g., "- Single value: []" 11 times). 46-49/50 1-shot episodes hit budget exhaustion. 4. **GRPO solves format compliance** — 95-100% parse rate (v1) vs 17-28% for base. The trained model almost always produces valid `` JSON. 5. **GRPO failure mode is SQL quality, not format** — episodes with correct tool-call format but wrong SQL/answer dominate GRPO failures. 6. **Extra turns don't help base models** — more steps just mean more repeated failures. The fundamental gap is format compliance, not exploration budget. 7. **1-shot occasionally gets lucky** — scored 2% in Run B (1/50 correct), 0% in Run A. At N=50, a single lucky episode swings accuracy by 2pp. ### v2 vs v1: similar accuracy, more parse failures — behavioral shift Across two runs, v1 and v2 show overlapping accuracy ranges (28-30% vs 24-32%). The difference is within run-to-run variation at N=50. However, v2 consistently shows more parse failures (8-22 vs 0-8), revealing a behavioral shift from continued training: - **v1 guesses more**: v1 almost always calls `answer()`, even when uncertain. It submits wrong answers confidently (0 parse failures in Run B, 100% parse rate). - **v2 gives up on hard questions**: v2 produces "Task complete." output after multiple failed queries instead of calling `answer()`, producing parse failures. v2 learned that some questions are unsolvable. - **Neither is clearly better**: v2's caution helps on some runs (32% in Run A) and hurts on others (24% in Run B). The abstention behavior adds variance. For the blog, present them as equivalent (~30%) with a qualitative note about the behavioral difference. The v2 parse failure pattern (from raw output): ``` [OK] DESCRIBE: country [OK] QUERY: SELECT Name FROM country WHERE Population < ... [PARSE FAIL] raw: Task complete. ← gives up, doesn't call answer() [PARSE FAIL] raw: Task complete. ← repeats until budget ``` Compare v1 on the same type of question: ``` [OK] DESCRIBE: country [OK] QUERY: SELECT Name FROM country WHERE ... [OK] ANSWER: European cities and their names are: 42 ← wrong, but at least calls answer() ``` This is a form of **calibrated uncertainty** — v2 is better at knowing what it doesn't know. The incorrect answer reward of 0.0 (see learning #19 in session log) creates an avoid-answering incentive that v2 has partially internalized. A more generous incorrect-answer reward (e.g., +0.1 for attempting an answer in correct format) might recover these episodes. ### For the blog narrative The story is clear: GRPO teaches format compliance (0% → 95-100% parse rate) and strategic tool use (describe→query→answer in 3-4 steps). Base models waste 15 steps repeating parse failures. The ~30% accuracy ceiling (consistent across checkpoints and runs) comes from the 0.6B model's SQL reasoning capacity, not from the environment or training pipeline. The environment scales; the model doesn't. Report v1 and v2 as "roughly 30%" — the variation between runs is larger than the difference between checkpoints. ## Format Mismatch Discovery (F011) ### The three differences between eval and training 1. **role:tool vs role:user** — Qwen3 renders `role:"tool"` as `<|im_start|>user\n...`, `role:"user"` as `<|im_start|>user\nplain text`. Same role token, different content structure. 2. **Structured tool_calls vs raw text** — Training uses `{"role":"assistant", "tool_calls":[{"function":{"name":"describe","arguments":"{...}"}}]}`, eval was using `{"role":"assistant", "content":"..."}`. 3. **No separator vs `\n\n`** — TRL appends `reset()` return directly to user message. Eval had `question\n\ntable_hint`. ### Impact Before fix: 0% accuracy across ALL conditions (zero-shot, 1-shot, 3-shot, GRPO checkpoint). After fix: 10% zero-shot, 30% 1-shot, 50% 3-shot on base model. GRPO checkpoint still 10%. ### Lesson Eval format matching is not a nice-to-have. It's a prerequisite for ANY measurement. We spent time debugging model quality when the problem was plumbing. ## Multi-Turn SFT — Why It's Critical ### Per-turn SFT (broken) - 347 examples, each one assistant turn - ~50% were describe calls - Model learned: "when asked a question, call describe" - With KL penalty, model stayed anchored to this single-action policy - Result: reward=0.00, all rollouts identical, advantage=0 ### Multi-turn SFT (working) - 120 examples, each a full describe→query→answer trajectory - `assistant_only_loss` via Qwen3 template patch (`{% generation %}` tags) - Model learned: the SEQUENCE describe→query→answer - With KL penalty, model explores within the multi-turn strategy - Result: reward climbs to 0.7 in Phase 1 ### Template patch detail Qwen3's chat template lacks `{% generation %}` tags needed by TRL for assistant_only_loss. We patch the template before SFT, restore original before GRPO (TRL does exact-match checks on template string in `add_response_schema()` and `get_training_chat_template()`). ## The 0.6B Capacity Ceiling ### What works at 0.6B - Single-table queries: COUNT, SUM, AVG, MIN, MAX, GROUP BY, HAVING, ORDER BY, LIMIT - Simple JOINs between 2 tables when FK is obvious (evaluation.Employee_ID = employee.Employee_ID) - WHERE with LIKE, IN, BETWEEN, NOT IN subqueries - Answer formatting: comma lists, pipe-delimited rows, `[]` for empty - Error recovery: describe after SQL error, retry with correct column names - `sample` tool usage (learned in Run 6, inconsistent later) ### What breaks at 0.6B - FK chain reasoning: 3+ table joins (Documents→Templates→Ref_Template_Types) - Column name fidelity: reads `FullName` from describe, writes `full_name` in SQL - Ambiguous column resolution: joins with same column name in both tables - Complex subqueries: INTERSECT, EXCEPT, correlated subqueries with HAVING - "stadium without concert" pattern: NOT IN with JOIN to get names - Aggregate + GROUP BY + HAVING chains on multi-table joins ### The hallucination pattern The model describes a table and sees the exact column names. Then it writes SQL using pretrained column names that don't match. This isn't a memory problem — the schema is in the context window. It's a weight problem — pretraining biases override in-context information at 0.6B scale. ## Thinking Mode Observations (Run 8) ### Benefits - Reasons through SQL errors: "no such column: airport_code" → `` block → tries `AirportCode` - Empty `` on easy questions — token-efficient, emergent behavior - Multi-step join planning in think blocks ### New failure mode ~23% of rollouts: `assistantassistant...` repeating until token limit. Model fails to close `` tag. Burns entire token budget with garbage. ### OOM risk Thinking blocks consume more tokens → higher peak memory during generation. Phase 2 crashed at step 182/467 with max_new_tokens=1280. Fix: reduce to 1024, or reduce num_generations from 4 to 3. ## Narrative Threads for Blog ### "The environment is the product" From VISION.md: "SQLEnv is a reinforcement learning environment — not a text-to-SQL model. The environment is the product." The trained agent demonstrates that the environment works, but the contribution is the action space, reward architecture, and episode structure. ### "Small model showing improvement proves more than large model with marginal gains" A 0.6B model going from 0% to 10% accuracy with clear strategic behavior (describe→query→answer, error recovery) proves the environment produces learning signal. A 70B model with marginal gains would prove nothing about the environment. ### "Analysts don't write perfect queries from scratch" The hook. Frame the problem as: text-to-SQL evaluates guessing, not investigating. SQLEnv evaluates the process. ### "Dense rewards need theory" Potential-based shaping isn't just good practice — it's the guarantee that the agent optimizes for the right objective. Without it, we saw agents farming exploration rewards. ### "Multi-turn SFT teaches strategy, not actions" The difference between per-turn and multi-turn SFT is the difference between teaching vocabulary and teaching conversation. ## References for Blog - Ng, Harada, Russell (1999). Policy invariance under reward transformations. ICML. - DeepSeek-AI (2025). DeepSeek-R1. - Shao et al. (2024). DeepSeek-Math: GRPO. - Sullivan et al. (2025/2026). GRPO is Secretly a Process Reward Model. ICLR 2026. - Yu et al. (2018). Spider dataset. - Li et al. (2023). BIRD benchmark. - TIPS (2026). Turn-Level Information-Potential Reward Shaping. - ToolRL (2025). Reward is All Tool Learning Needs. - StepTool (2024). Step-grained RL for Tool Learning. ## Showcase Notebook Transcripts (for blog) ### Random agent episode (seed=7) — comedic failure Question: "Count the number of paragraphs." ``` SAMPLE Paragraphs → reward=0.015 SAMPLE Documents → reward=0.015 DESCRIBE Documents → reward=0.015 SAMPLE Documents → reward=0.015 (repeat) DESCRIBE Documents → reward=0.015 (repeat) DESCRIBE Documents → reward=0.015 (repeat) DESCRIBE Templates → reward=0.015 SAMPLE Documents → reward=0.015 (repeat) DESCRIBE Documents → reward=0.015 (repeat) QUERY SELECT * FROM "Templates" LIMIT 5 → reward=0.0625 DESCRIBE Documents → reward=0.015 (repeat) DESCRIBE Paragraphs → reward=0.015 QUERY SELECT * FROM "Paragraphs" LIMIT 5 → reward=0.025 QUERY SELECT * FROM "Documents" LIMIT 5 → reward=0.025 ANSWER 76 | 20 | Robbin CV | y | None → reward=0.000 (incorrect) ``` Total reward: 0.278. Used all 15 steps. Described Documents 5 times. Answered with a random row from the wrong table. Never wrote `SELECT COUNT(*)`. ### Oracle agent episode (seed=0) — clean solve Question: "List the id of students who registered some courses and the number of their registered courses?" ``` Step 1: DESCRIBE student_course_registrations → student_id INTEGER, course_id INTEGER, registration_date DATETIME → reward: +0.015 Step 2: DESCRIBE students → student_id INTEGER, student_details VARCHAR(255) → reward: +0.015 Step 3: QUERY SELECT T1.student_id, count(*) FROM students AS T1 JOIN student_course_registrations AS T2 ON T1.student_id = T2.student_id GROUP BY T1.student_id → 111|1, 121|2, 131|1, 141|2, 151|1, 161|1, 171|1 → reward: +0.150 Step 4: ANSWER [[111,1],[121,2],[131,1],[141,2],[151,1],[161,1],[171,1]] → correct → reward: +1.000 ``` Total reward: 1.180. 4 steps, efficient. Exploration (L1+L2): 0.180, Terminal (L3): 1.000. ### Baseline comparison (50 episodes each) | Policy | Success Rate | Avg Reward | Avg Steps | |--------|-------------|------------|-----------| | Random | 0.0% | 0.247 | 15.0 | | Oracle | 100.0% | 1.168 | 3.5 | The gap between 0.247 and 1.168 defines the learning space. A trained agent lands somewhere between. ### Reward constants (from server/reward.py) ``` +0.02 successful execution (no errors) +0.01 new information (unique query) -0.02 step cost (every action) -0.03 repeat penalty (duplicate SQL) [-0.10, +0.15] per-step clipping range +1.0 correct answer (terminal) +0.0 wrong answer (terminal) ``` Terminal dominance: max exploration over 15 steps is ~0.3 (15 * 0.02 best case), while a correct answer adds 1.0. ## Competition Context ### OpenEnv Challenge (our target) - Sponsors: PyTorch/Meta, HuggingFace, Unsloth - Prize: $10K HF credits - Judging: primarily blog-based - Criteria: Creative OpenEnv use, Technical excellence, Storytelling, Open source demo, Green Agent wrapper - Green Agent wrapper is an explicit judging criterion in the OpenEnv Challenge. ### Deliverables 1. Environment on HF Hub — **live** at https://huggingface.co/spaces/hjerpe/sql_env (pushed 2026-03-29; Docker image at `registry.hf.space/hjerpe-sql_env:latest`) 2. Training notebooks/scripts on GitHub — `notebooks/train_grpo.ipynb`, `notebooks/compare_methods.ipynb`, `notebooks/showcase_sqlenv.ipynb` 3. Blog on HuggingFace — `docs/blog-post-v1.md` (draft) ### TRL integration status (already done — do not re-research) `training/trl_adapter.py::SQLEnvTRL` is a TRL-native `environment_factory` class: `reset()` + named tool methods `describe() / sample() / query() / answer()` with docstrings TRL uses to build the tool schema. The notebook passes it directly: `GRPOTrainer(..., environment_factory=SQLEnvTRL, reward_funcs=[sql_env_reward_func])`. The adapter runs `SQLEnvironment` **in-process** (not a WebSocket client to the HF Space) — intentional, because training opens N parallel sessions and the Space defaults to 1. ### Competitive landscape - **SQL Repair** (WALKMAN303) — buggy SQL fix, simpler than our multi-turn exploration - **Calendar Gym** (Turing) — featured on HF blog, real-world framing + failure analysis - **OpenSec** — cybersecurity with arXiv paper, adversarial evidence injection - Our position: no interactive SQL *exploration* environment exists. SQL Repair is single-turn fix-it; we're multi-turn strategy discovery. ### What winning entries do 1. Stakes framing — "this matters in production" 2. Concrete failure analysis with numbers 3. Contrast (random vs trained vs oracle) 4. Real data, not toy puzzles 5. Non-obvious insights from training ## Green Agent Evaluator ### What it is OpenEnv's standardized evaluation wrapper pattern. A `Policy` protocol with `evaluate(env, policy, n_episodes, seed)` that runs any policy through the environment and reports aggregate metrics. Listed as an explicit judging criterion in the OpenEnv Challenge. ### Implementation - `evaluation/policies.py` — `Policy` protocol, `evaluate()` harness, `RandomPolicy`, `EpisodeResult`, `EvaluationResult` - `evaluation/oracle_policy.py` — `OraclePolicy` baseline (runs gold SQL) - `tests/test_evaluation.py` — 17 tests, all passing (unit + integration) ### How it works ```python from sql_env.evaluation import evaluate, RandomPolicy, OraclePolicy # Run 50 episodes with random policy result = evaluate(env, RandomPolicy(seed=0), n_episodes=50, seed=0) print(f"Success: {result.success_rate:.1%}, Reward: {result.avg_reward:.3f}") # Run with trained policy (any class with select_action method) result = evaluate(env, trained_policy, n_episodes=50, seed=42) ``` ### Where it's used - `notebooks/showcase_sqlenv.ipynb` — Random vs Oracle baseline comparison - `notebooks/compare_methods.ipynb` — All 5 conditions (zero-shot, 1-shot, 3-shot, GRPO v1, v2) run through `evaluate()` ### Key design choices - **Error isolation**: one episode crashing doesn't kill the run — logged as `EpisodeResult(error=str(exc))` - **Deterministic seeding**: `seed + episode_index` per episode for reproducibility - **Protocol-based**: any class with `select_action(observation) -> action` works — no inheritance required - **Aggregate + per-episode**: `EvaluationResult` has both summary metrics and full `episodes` list for drill-down ### For the blog The Green Agent evaluator is the backbone of all evaluation. Every result in the comparison table flows through `evaluate()`. The trained GRPO model is wrapped in `LLMToolCallingPolicy` (which implements the `Policy` protocol) and evaluated identically to the Random and Oracle baselines. This is the standardized, reproducible evaluation pattern the challenge asks for. ## Files to Reference | File | Relevance | |------|-----------| | `docs/blog-outline.md` | Section structure template | | `docs/blog-post.md` | Current draft | | `docs/design-docs/reward-shaping-research.md` | Reward theory + references | | `docs/exploration/grpo-training-session-log.md` | All 9 runs detailed | | `vision/VISION.md` | Product vision, success metrics | | `training/trl_adapter.py` | Environment adapter code | | `notebooks/compare_methods.ipynb` | Eval notebook | | `notebooks/train_grpo.ipynb` | Training notebook |