sql_env / vision /VISION.md
hjerpe's picture
Upload folder using huggingface_hub
9e64e71 verified

SQLEnv Vision

Purpose

SQLEnv is a reinforcement learning environment β€” not a text-to-SQL model. The environment is the product. It provides reward signals, action space, and episode structure that enable any RL algorithm to train agents that explore databases iteratively, the way human data analysts do.

The Problem

Text-to-SQL benchmarks (Spider, BIRD, WikiSQL) evaluate one-shot SQL generation: given a question and schema, produce the correct query. This misses how data analysis works. Analysts don't write perfect queries from scratch β€” they explore schemas, run test queries, observe results, refine hypotheses, and iterate toward an answer.

No RL environment captures this multi-turn exploration process. Without one, we cannot train agents that develop investigative reasoning strategies.

What We Build

An OpenEnv-compatible RL environment where:

  • Schema starts hidden. Agents see only table names at reset. They must discover columns, types, relationships, and data through DESCRIBE and SAMPLE actions.
  • Exploration earns rewards. A 3-layer reward architecture provides feedback during exploration, not just at termination. This enables RL convergence.
  • Actions mirror analyst workflows. DESCRIBE (learn schema), SAMPLE (see data), QUERY (test hypotheses), ANSWER (submit result) β€” these reproduce the investigative loop of a working analyst.
  • Multi-hop reasoning emerges naturally. Spider questions requiring JOINs across 2-5+ tables force agents to discover schemas, identify relationships, and compose queries incrementally.

What We Do Not Build

  • A text-to-SQL model (we build the environment others train on)
  • A production SQL agent (the trained agent demonstrates the environment, nothing more)
  • A static benchmark (the environment is interactive and stateful)

Why OpenEnv

OpenEnv (Meta/HuggingFace) provides:

  • Gym-oriented API (reset()/step()) familiar to RL practitioners
  • Docker isolation for safe SQL execution
  • HuggingFace Spaces deployment
  • TRL/GRPO integration via environment_factory pattern
  • Standard evaluation harness for reproducible comparisons

Design Principles

1. The Environment Owns Rewards, Not Reasoning

The environment executes SQL, computes rewards, and enforces constraints. The agent owns all reasoning β€” schema discovery strategy, query composition, when to answer. We removed Ollama/LLM inference from the environment deliberately: the environment judges, it does not participate.

2. Terminal Correctness Dominates

Exploration rewards (capped at 0.5) never exceed terminal correctness (+1.0). An agent that explores but never answers scores less than one that answers correctly with minimal exploration. This prevents reward gaming.

3. Partial Observability Is the Point

The POMDP structure (hidden schema, truncated results, step budget) forces strategic exploration. It makes the environment worth training on.

4. Dense Rewards Enable Training

The 3-layer reward architecture (operational + progress + terminal) exists to make RL training feasible on small models. Without dense rewards, a <0.5B parameter model cannot learn from sparse terminal-only signals.

5. Small Models Over Large

Target model size: <0.5B parameters (e.g., Qwen3-0.6B). The goal is demonstrating that the environment produces learning signals, not achieving SOTA accuracy. A small model showing clear improvement over random proves more than a large model with marginal gains.

Competition Context

OpenEnv Challenge β€” Create a real-world RL environment and demonstrate training on it.

Evaluation criteria:

  1. Creative and robust use of OpenEnv
  2. Technical excellence
  3. Storytelling
  4. Open-source demo
  5. Green Agent wrapper

Our position: No interactive SQL exploration environment exists. The 3-layer reward architecture, multi-hop Spider questions, and POMDP framing are novel. The "analyst exploration" narrative is relatable.

Success Metrics

Environment Quality (primary)

  • Stable over 1000+ episodes without crashes
  • Reward differentiates: random ~0.1, targeted ~0.3, correct ~1.3
  • Oracle policy achieves ~100% success rate (validates reward ceiling)
  • Anti-gaming measures prevent degenerate strategies

Training Signal (secondary β€” demonstrates environment quality)

  • Trained model beats random baseline on success rate
  • Learning curve shows improvement over episodes
  • Episode transcripts show strategic vs random exploration

Submission Completeness

  • HuggingFace Space running
  • Training notebook reproducible
  • Blog post tells the story
  • Green Agent evaluation wrapper functional

Roadmap (Now / Next / Later)

Done

  • GRPO training pipeline (F006) β€” ~30% eval accuracy vs 0% base
  • Oracle policy + Green Agent evaluation wrapper (evaluation/)
  • TRL environment_factory adapter (training/trl_adapter.py::SQLEnvTRL, wired into notebooks/train_grpo.ipynb)
  • HuggingFace Space live at https://huggingface.co/spaces/hjerpe/sql_env (F007, pushed 2026-03-29 via uv run openenv push)

Now (in progress)

  • Publish blog post on HuggingFace (2026-04-12)
  • Final review of docs/blog-post-v1.md
  • Verify notebooks run clean on fresh Colab

Later (post-submission)

  • Enable concurrent sessions on the Space (SUPPORTS_CONCURRENT_SESSIONS=True, max_concurrent_envs=64) so external users can retrain against the hosted endpoint without hitting the default 1-session limit
  • Difficulty curriculum (easy -> medium -> hard)
  • Additional Spider databases and question sources
  • Multi-database metamorphic verification in training
  • Community-contributed policies and training recipes