---
title: SQLEnv
emoji: 🤖
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8000
pinned: true
base_path: /web
---

# SQLEnv: Teaching Small Models to Explore Databases

![Python](https://img.shields.io/badge/python-3.12-blue.svg)
![License](https://img.shields.io/badge/license-MIT-green.svg)
![Data](https://img.shields.io/badge/data-CC%20BY--SA%204.0-orange.svg)

SQLEnv is an RL environment for training small language models to answer questions about SQL databases through iterative exploration. Instead of producing one-shot SQL from a fully visible schema, the agent discovers the schema step by step using four tools: DESCRIBE, SAMPLE, QUERY, and ANSWER.

Built on [OpenEnv](https://github.com/meta-pytorch/OpenEnv) and trained with [TRL](https://huggingface.co/docs/trl)'s GRPO implementation. A 0.6B parameter model trained in this environment goes from 0% to ~30% accuracy on a curated Spider subset, learning to explore schemas, recover from SQL errors, and format answers correctly.

**[Blog post](https://hjerpe-sqlenv-blog.static.hf.space)** | **[Live environment](https://huggingface.co/spaces/hjerpe/sql_env)** | **[Training notebook](notebooks/train_grpo.ipynb)**

## Quick Start

```bash
uv sync
uv run pytest tests/ -v
```

Run the environment locally:

```bash
uv run uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
```

Or with Docker:

```bash
docker build -t sqlenv:latest -f server/Dockerfile .
docker run -p 8000:8000 sqlenv:latest
```

## How It Works

Each episode starts with a natural-language question and a list of table names. The schema (columns, types, relationships) is hidden. The agent uses four actions to explore:

| Action | Purpose |
|--------|---------|
| `DESCRIBE table` | Reveal column names, types, and row count |
| `SAMPLE table` | Preview representative rows |
| `QUERY sql` | Execute read-only SQL |
| `ANSWER value` | Submit a final answer (ends episode) |

The environment provides dense reward at each step (operational feedback + progress toward the answer) and a terminal reward for correctness (+1.0 correct, 0.0 wrong). See the [blog post](https://hjerpe-sqlenv-blog.static.hf.space) for details on the reward architecture.

```python
from server.sql_environment import SQLEnvironment, SQLAction

env = SQLEnvironment(questions_path="data/questions/questions_train.json",
                     db_dir="data/databases", tokenizer=tok)
obs = env.reset(seed=42)
obs = env.step(SQLAction(action_type="DESCRIBE", argument="employee"))
obs = env.step(SQLAction(action_type="QUERY", argument="SELECT COUNT(*) FROM employee"))
obs = env.step(SQLAction(action_type="ANSWER", argument="10"))
# obs.done=True, obs.reward=1.0
```

## Training

We train [Qwen3-0.6B](https://arxiv.org/abs/2505.09388) using [GRPO](https://arxiv.org/abs/2402.03300) (from DeepSeekMath) through TRL's `environment_factory`. The full pipeline (SFT warmup + two-phase GRPO) runs in ~5 hours on a single Colab L4.

**Notebooks:**
- **[train_grpo.ipynb](notebooks/train_grpo.ipynb)** runs the full SFT + GRPO pipeline
- **[compare_methods.ipynb](notebooks/compare_methods.ipynb)** evaluates base vs trained models
- **[showcase_sqlenv.ipynb](notebooks/showcase_sqlenv.ipynb)** lets you explore the environment interactively

**Local test (CPU, ~3 min):**

```bash
docker build -f Dockerfile.test -t sqlenv-test .
docker run --rm sqlenv-test
```

## Evaluation

All evaluation runs through the Green Agent evaluator:

```python
from sql_env.evaluation import evaluate, RandomPolicy, OraclePolicy

result = evaluate(env, policy, n_episodes=50, seed=0)
print(f"Accuracy: {result.success_rate:.1%}, Reward: {result.avg_reward:.3f}")
```

Results on our curated 10-database Spider subset (N=50, 2 runs):

| Method | Accuracy | Parse Rate | Avg Steps |
|--------|----------|------------|-----------|
| Zero-shot | 0% | 24-28% | 10.8-12.4 |
| 1-shot | 0-2% | 16-17% | 14.0-14.8 |
| 3-shot | 0% | 19-20% | 13.8-14.8 |
| GRPO v1 (2 epochs) | 28-30% | 95-100% | 3.5-4.0 |
| GRPO v2 (4 epochs) | 24-32% | 87-95% | 3.5-4.0 |

This evaluation is not comparable to the official Spider leaderboard, which uses different scoring, full-schema input, and a broader database set. See the [blog post](https://hjerpe-sqlenv-blog.static.hf.space) for detailed analysis.

## Data

676 questions (473 train, 203 eval) across 10 Spider databases with difficulty labels, plus 120 multi-turn SFT warmup trajectories generated from gold SQL. See [docs/data-sources.md](docs/data-sources.md) for full details on provenance, curation, and regeneration.

Data in `data/` is adapted from [Spider](https://yale-lily.github.io/spider) (Yu et al., 2018) and shared under CC BY-SA 4.0. See [DATA_LICENSE](DATA_LICENSE).

## Project Structure

```
sqlenv/
├── __init__.py, client.py, models.py    # Core types and client
├── server/
│   ├── app.py                           # FastAPI server
│   ├── sql_environment.py               # Environment implementation
│   ├── reward.py                        # Three-layer reward function
│   ├── verifier.py                      # Answer verification
│   └── Dockerfile                       # HF Spaces deployment
├── evaluation/                          # Green Agent evaluator, policies
├── training/                            # TRL adapter, data loading
├── scripts/                             # Data curation, SFT generation
├── notebooks/                           # Training, evaluation, showcase
├── data/
│   ├── databases/                       # 10 Spider SQLite databases
│   ├── questions/                       # Train/eval question sets
│   └── sft/                             # SFT warmup trajectories
├── configs/                             # Training configurations
├── tests/                               # Unit and integration tests
└── docs/
    ├── data-sources.md                  # Data provenance
    └── ARCHITECTURE.md                  # System architecture
```

## References

- Yu et al. (2018). [Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task](https://yale-lily.github.io/spider). EMNLP.
- Shao et al. (2024). [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://arxiv.org/abs/2402.03300). (GRPO algorithm)
- Ng, Harada, Russell (1999). [Policy Invariance Under Reward Transformations](https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/NgHaradaRussell-shaping-ICML1999.pdf). ICML.
- [OpenEnv framework](https://github.com/meta-pytorch/OpenEnv)
- [TRL OpenEnv docs](https://huggingface.co/docs/trl/openenv)

## License

Code: [MIT](LICENSE). Data: [CC BY-SA 4.0](DATA_LICENSE).