# SQLEnv Blog Post Outline

## 1) Cold Open: Two Agents, Same Question

Open with two transcripts side by side — no explanation yet, just show the contrast.

**Random agent** (from showcase notebook, seed=7):
- "Count the number of paragraphs."
- SAMPLEs the same table 4 times, DESCRIBEs Documents 5 times, runs 3 SELECT * queries, then submits a random row as the answer.
- Reward: 0.278, incorrect.

**Trained agent** (from blog-material, error recovery example):
- "Which employee received the biggest bonus?"
- Describes employee, tries wrong column (Salary), gets error, describes evaluation to find Bonus column, writes correct JOIN, answers "Louis Deacon".
- Reward: 1.13, correct.

One explored strategically. The other wandered. Both had the same tools, the same budget, the same database. The difference is learned behavior.

## 2) The Gap (3 sentences, not a section)

Text-to-SQL benchmarks give the model the full schema and ask for one query. That tests memorization, not investigation. SQLEnv hides the schema and gives the agent a step budget — forcing it to develop the exploration strategy that makes human analysts reliable.

## 3) Four Actions, One Budget

Introduce the action space through the oracle episode (showcase notebook, seed=0):
- Question: "List the id of students who registered some courses and the number of their registered courses?"
- Step 1: DESCRIBE student_course_registrations → sees columns (+0.015)
- Step 2: DESCRIBE students → sees student_id (+0.015)
- Step 3: QUERY with JOIN + GROUP BY → gets the answer (+0.150)
- Step 4: ANSWER → correct (+1.000)
- Total: 1.180 in 4 steps.

Then show the reward architecture table:
- L1 Operational: +0.02 execution, +0.01 new info, -0.01 repeats, -0.005 step cost
- L2 Progress: delta from previous query result (potential-based)
- L3 Terminal: +1.0 correct, 0.0 wrong

Key point: terminal dominates. Max exploration over 15 steps is ~0.3; correct answer is 1.0. No farming.

## 4) Training: SFT Teaches Strategy, GRPO Refines It

NOT "here's how GRPO works." Lead with the insight:

- Per-turn SFT (347 examples) taught the model to call describe forever. It never learned when to query or answer.
- Multi-turn SFT (120 full trajectories with assistant_only_loss) taught describe-then-query-then-answer as a coherent strategy.
- GRPO then refined this into real behaviors: error recovery, answer formatting, knowing when to stop.

Two-phase curriculum:
- Phase 1: Easy questions with KL penalty — stabilize format
- Phase 2: Easy + medium without KL — allow exploration

Show the reward curve: -0.1 to 0.5-0.7 over 400 steps. Clear learning signal.

## 5) What the Agent Learned

Lead with observed behaviors, not metrics:
- **Schema discovery**: always describes before querying
- **Error recovery**: wrong column → re-describe → correct retry (concrete example)
- **Answer formatting**: comma-separated lists, pipe-delimited rows, [] for empty results
- **Subquery composition**: NOT IN, GROUP BY HAVING, UNION queries

These emerged from reward signal, not hard-coded rules.

Comparison table (N=50 eval, 2026-04-11):

| Method | Accuracy | Parse Rate | Avg Steps |
|--------|----------|------------|-----------|
| Zero-shot | 0% | 28% | 10.8 |
| 1-shot | 0% | 16% | 14.8 |
| 3-shot | 0% | 20% | 13.8 |
| GRPO v1 | 28% | 95% | 4.0 |
| GRPO v2 | 32% | 87% | 3.7 |

Two things stand out. First, 95% parse rate — the trained model almost always produces valid tool-call JSON. The base model fails 72-84% of the time and wastes its entire step budget repeating the same malformed output. Second, 28-32% accuracy from 0% — the environment produces genuine learning. The base model can't get a single answer right even with 3 examples; the trained model solves 14-16 out of 50 in just 3-4 steps.

## 6) What the Agent Can't Do (The Interesting Part)

This is where small models hit a wall — and the wall tells us something about the environment.

- **Column name hallucination**: reads `FullName` from DESCRIBE, writes `full_name` in SQL. Pretraining biases override in-context schema. A 0.6B model can't fight its own weights.
- **FK chain reasoning**: single-table queries work; 3+ table JOINs don't. The model can't chain Documents -> Templates -> Ref_Template_Types.
- **More RL doesn't help**: v2 (double the training steps) produced identical accuracy. The ceiling is pretraining knowledge, not training budget.

This is actually the point: the environment produces a clear learning signal that saturates at the model's capacity. A larger model (or better SFT on JOIN patterns) would push higher. The environment scales; the 0.6B model doesn't.

## 7) Reward Theory (Brief — For Technical Judges)

One paragraph: potential-based shaping (Ng et al., 1999). Our delta progress rewards are F(s,s') = phi(s') - phi(s), which provably preserves the optimal policy. Without this guarantee, agents learn to farm exploration rewards instead of answering questions. We observed this directly when we tried cumulative progress caps (not potential-based) — the agent explored endlessly.

## 8) Technical Highlights (Bullet List)

- 10 Spider databases with structured metadata and deterministic train/eval split
- Typed action and observation models (Pydantic) — every interaction is explicit
- Read-only SQL via SQLite mode=ro — safety from the database engine, not regex
- TRL environment_factory integration — plugs into standard GRPO training
- Docker packaging for HuggingFace Spaces with health checks
- 473 training / 50 eval questions across easy/medium difficulty

## 9) Try It Yourself

- HuggingFace Space: [live demo]
- Training notebook: notebooks/train_grpo.ipynb — runs on Colab L4 in ~7 hours
- Showcase notebook: notebooks/showcase_sqlenv.ipynb — explore the environment interactively
- GitHub: full source, architecture docs

## 10) What We Learned (Close with Insights)

Three non-obvious findings:

1. **Multi-turn SFT teaches strategy, not actions.** Per-turn examples teach vocabulary; multi-turn examples teach conversation. The difference is the difference between a model that calls describe forever and one that knows when to answer.

2. **Transparent errors produce better policies.** When the environment surfaces "Error: no such column: full_name" instead of empty results, the agent develops error-recovery strategies. Better diagnostics produce better gradient signal.

3. **Dense rewards need theory.** Potential-based shaping isn't just good practice — it's the guarantee that the agent optimizes for the right objective. Without it, we observed agents farming exploration rewards at the expense of answering questions.