--- title: SQLEnv emoji: 🤖 colorFrom: blue colorTo: green sdk: docker app_port: 8000 pinned: true base_path: /web --- # SQLEnv: Teaching Small Models to Explore Databases ![Python](https://img.shields.io/badge/python-3.12-blue.svg) ![License](https://img.shields.io/badge/license-MIT-green.svg) ![Data](https://img.shields.io/badge/data-CC%20BY--SA%204.0-orange.svg) SQLEnv is an RL environment for training small language models to answer questions about SQL databases through iterative exploration. Instead of producing one-shot SQL from a fully visible schema, the agent discovers the schema step by step using four tools: DESCRIBE, SAMPLE, QUERY, and ANSWER. Built on [OpenEnv](https://github.com/meta-pytorch/OpenEnv) and trained with [TRL](https://huggingface.co/docs/trl)'s GRPO implementation. A 0.6B parameter model trained in this environment goes from 0% to ~30% accuracy on a curated Spider subset, learning to explore schemas, recover from SQL errors, and format answers correctly. **[Blog post](https://hjerpe-sqlenv-blog.static.hf.space)** | **[Live environment](https://huggingface.co/spaces/hjerpe/sql_env)** | **[Training notebook](notebooks/train_grpo.ipynb)** ## Quick Start ```bash uv sync uv run pytest tests/ -v ``` Run the environment locally: ```bash uv run uvicorn server.app:app --reload --host 0.0.0.0 --port 8000 ``` Or with Docker: ```bash docker build -t sqlenv:latest -f server/Dockerfile . docker run -p 8000:8000 sqlenv:latest ``` ## How It Works Each episode starts with a natural-language question and a list of table names. The schema (columns, types, relationships) is hidden. The agent uses four actions to explore: | Action | Purpose | |--------|---------| | `DESCRIBE table` | Reveal column names, types, and row count | | `SAMPLE table` | Preview representative rows | | `QUERY sql` | Execute read-only SQL | | `ANSWER value` | Submit a final answer (ends episode) | The environment provides dense reward at each step (operational feedback + progress toward the answer) and a terminal reward for correctness (+1.0 correct, 0.0 wrong). See the [blog post](https://hjerpe-sqlenv-blog.static.hf.space) for details on the reward architecture. ```python from server.sql_environment import SQLEnvironment, SQLAction env = SQLEnvironment(questions_path="data/questions/questions_train.json", db_dir="data/databases", tokenizer=tok) obs = env.reset(seed=42) obs = env.step(SQLAction(action_type="DESCRIBE", argument="employee")) obs = env.step(SQLAction(action_type="QUERY", argument="SELECT COUNT(*) FROM employee")) obs = env.step(SQLAction(action_type="ANSWER", argument="10")) # obs.done=True, obs.reward=1.0 ``` ## Training We train [Qwen3-0.6B](https://arxiv.org/abs/2505.09388) using [GRPO](https://arxiv.org/abs/2402.03300) (from DeepSeekMath) through TRL's `environment_factory`. The full pipeline (SFT warmup + two-phase GRPO) runs in ~5 hours on a single Colab L4. **Notebooks:** - **[train_grpo.ipynb](notebooks/train_grpo.ipynb)** runs the full SFT + GRPO pipeline - **[compare_methods.ipynb](notebooks/compare_methods.ipynb)** evaluates base vs trained models - **[showcase_sqlenv.ipynb](notebooks/showcase_sqlenv.ipynb)** lets you explore the environment interactively **Local test (CPU, ~3 min):** ```bash docker build -f Dockerfile.test -t sqlenv-test . docker run --rm sqlenv-test ``` ## Evaluation All evaluation runs through the Green Agent evaluator: ```python from sql_env.evaluation import evaluate, RandomPolicy, OraclePolicy result = evaluate(env, policy, n_episodes=50, seed=0) print(f"Accuracy: {result.success_rate:.1%}, Reward: {result.avg_reward:.3f}") ``` Results on our curated 10-database Spider subset (N=50, 2 runs): | Method | Accuracy | Parse Rate | Avg Steps | |--------|----------|------------|-----------| | Zero-shot | 0% | 24-28% | 10.8-12.4 | | 1-shot | 0-2% | 16-17% | 14.0-14.8 | | 3-shot | 0% | 19-20% | 13.8-14.8 | | GRPO v1 (2 epochs) | 28-30% | 95-100% | 3.5-4.0 | | GRPO v2 (4 epochs) | 24-32% | 87-95% | 3.5-4.0 | This evaluation is not comparable to the official Spider leaderboard, which uses different scoring, full-schema input, and a broader database set. See the [blog post](https://hjerpe-sqlenv-blog.static.hf.space) for detailed analysis. ## Data 676 questions (473 train, 203 eval) across 10 Spider databases with difficulty labels, plus 120 multi-turn SFT warmup trajectories generated from gold SQL. See [docs/data-sources.md](docs/data-sources.md) for full details on provenance, curation, and regeneration. Data in `data/` is adapted from [Spider](https://yale-lily.github.io/spider) (Yu et al., 2018) and shared under CC BY-SA 4.0. See [DATA_LICENSE](DATA_LICENSE). ## Project Structure ``` sqlenv/ ├── __init__.py, client.py, models.py # Core types and client ├── server/ │ ├── app.py # FastAPI server │ ├── sql_environment.py # Environment implementation │ ├── reward.py # Three-layer reward function │ ├── verifier.py # Answer verification │ └── Dockerfile # HF Spaces deployment ├── evaluation/ # Green Agent evaluator, policies ├── training/ # TRL adapter, data loading ├── scripts/ # Data curation, SFT generation ├── notebooks/ # Training, evaluation, showcase ├── data/ │ ├── databases/ # 10 Spider SQLite databases │ ├── questions/ # Train/eval question sets │ └── sft/ # SFT warmup trajectories ├── configs/ # Training configurations ├── tests/ # Unit and integration tests └── docs/ ├── data-sources.md # Data provenance └── ARCHITECTURE.md # System architecture ``` ## References - Yu et al. (2018). [Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task](https://yale-lily.github.io/spider). EMNLP. - Shao et al. (2024). [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://arxiv.org/abs/2402.03300). (GRPO algorithm) - Ng, Harada, Russell (1999). [Policy Invariance Under Reward Transformations](https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/NgHaradaRussell-shaping-ICML1999.pdf). ICML. - [OpenEnv framework](https://github.com/meta-pytorch/OpenEnv) - [TRL OpenEnv docs](https://huggingface.co/docs/trl/openenv) ## License Code: [MIT](LICENSE). Data: [CC BY-SA 4.0](DATA_LICENSE).