# Architecture > Last updated: 2026-03-29 System map for SQLEnv, an RL environment where agents learn interactive SQL exploration via the OpenEnv framework. **Goals:** - Show how components connect (system map + key flows) - Make hidden state explicit (what lives where) - Define shared interfaces (Pydantic models, HTTP API) - Keep invariants legible (what must stay true) **Non-goals:** - Exhaustive API reference - Training hyperparameter tuning guide --- ## System Map ```text SQLEnv System ================================================================ RL Training SQLEnv Server (Docker) ───────────── ────────────────────── +──────────────+ +─────────────────────+ │ TRL GRPO │ │ server/app.py │ │ Trainer │ HTTP (JSON) │ FastAPI + OpenEnv │ │ │<========================>│ │ │ training/ │ SQLAction -> server +──────────┬──────────+ │ trl_adapter │ SQLObs <- server │ │ .py │ v +──────────────+ +─────────────────────+ │ │ SQLEnvironment │ │ OR │ (sql_environment.py)│ v │ │ +──────────────+ │ reset() / step() │ │ Custom │ │ action dispatch │ │ rollout_func │ +──┬──────┬──────┬────+ │ (rollout.py) │ │ │ │ +──────────────+ v v v +────────────────────────+ Evaluation │ Action Handlers │ ────────── │ DESCRIBE → PRAGMA │ +──────────────+ │ SAMPLE → SELECT N │ │ evaluate() │──> env.reset/step │ QUERY → SQL exec │ │ policies │ │ ANSWER → verifier │ │ .py │ +────────┬───────────────+ +──────────────+ │ │ v +──────────────+ +────────────────────────+ │ Policies │ │ SQLite (read-only) │ │ RandomPolicy │ │ data/databases/ │ │ OraclePolicy │ │ {db_id}/{db_id}.sqlite │ +──────────────+ +────────────────────────+ │ ┌───────┴───────┐ v v +───────────+ +───────────+ │ reward.py │ │verifier.py│ │ 3-layer │ │ type-aware│ │ dense │ │ comparison│ +───────────+ +───────────+ Data (committed) Synthetic (optional) ──────────────── ──────────────────── data/questions/ server/synthetic/ questions_train.json (473 Q) generate.py questions_eval.json (203 Q) mutations.py db_list.json (10 databases) validate.py ``` --- ## Component Inventory | Component | Owns | Entrypoint | State / Output | |-----------|------|------------|----------------| | **SQLEnvironment** | Episode lifecycle, action dispatch, step budget | `server/sql_environment.py` | `EpisodeContext` (in-memory, per episode) | | **FastAPI app** | HTTP endpoints, tokenizer factory | `server/app.py` | Stateless (delegates to environment) | | **SQLEnvClient** | HTTP transport, payload serialization | `client.py` | Stateless (wraps server) | | **Pydantic models** | Type contracts (action, observation, state) | `models.py` | N/A (data classes) | | **Reward engine** | 3-layer dense reward computation | `server/reward.py` | Mutates `EpisodeContext` accumulators | | **Answer verifier** | Type-aware answer comparison | `server/verifier.py` | Stateless (pure function) | | **GRPO pipeline** | Training orchestration, rollout, reward callables | `training/` (6 modules) | Training artifacts in `outputs/` | | **TRL adapter** | `environment_factory` for TRL GRPOTrainer | `training/trl_adapter.py` | Per-session environment instances | | **Evaluation** | Policy protocol, evaluate() runner | `evaluation/policies.py` | `EvaluationResult` metrics | | **Oracle policy** | Deterministic upper-bound baseline | `evaluation/oracle_policy.py` | Stateless per-step | | **Synthetic DB gen** | Metamorphic testing via data mutations | `server/synthetic/` | Variant SQLite files | | **Question dataset** | 676 curated Spider questions across 10 DBs | `data/questions/` | JSON files | ### External Dependencies | Dependency | Purpose | Required | |------------|---------|----------| | SQLite (stdlib) | Database execution | Yes | | OpenEnv (`openenv-core`) | Environment protocol, `create_app` | Yes | | TRL (`trl`) | GRPO training | Only for training | | HuggingFace Transformers | Tokenizer loading | Only for production server | --- ## Key Flows ### Flow: Episode (Reset + Multi-Turn Steps) ```text Client / Policy SQLEnvironment │ │ │── reset(seed=42) ────────────────> │ │ │── pick question (random or seeded) │ │── open read-only SQLite connection │ │── execute gold_sql → store gold_rows │ │── init EpisodeContext (budget=15) │ <── SQLObservation ────────────────│ │ .question="How many students?" │ │ .schema_info="Tables: student" │ (column details hidden) │ .budget_remaining=15 │ │ │ │── step(DESCRIBE student) ────────> │ │ │── PRAGMA table_info(student) │ │── add to described_tables │ │── compute_step_reward() │ <── SQLObservation ────────────────│ │ .schema_info="student: id INT" │ (columns now revealed) │ .result="5 columns, 20 rows" │ │ .reward=0.02 │ │ .budget_remaining=14 │ │ │ │── step(QUERY "SELECT COUNT(*)...") │ │ │── validate (SELECT-only, single stmt) │ │── execute with 5s timeout │ │── compute_step_reward() (L1 + L2) │ <── SQLObservation ────────────────│ │ .result="| COUNT(*) |\n| 20 |" │ │ .reward=0.035 │ │ │ │── step(ANSWER "20") ─────────────> │ │ │── verify_answer("20", gold, type) │ │── terminal reward: +1.0 or 0.0 │ <── SQLObservation ────────────────│ │ .done=true │ │ .reward=1.0 │ ``` ### Flow: 3-Layer Reward Computation ```text step() called with action │ v Layer 1: Operational Shaping (every action) ├── exec_ok? → +0.02 ├── new SQL hash? → +0.01 (per unique query, no cumulative cap) ├── repeated SQL? → -0.01 └── step cost → -0.005 │ v (only if action_type == QUERY and no error) Layer 2: Progress Shaping (delta-from-previous, PBRS) ├── cardinality score (25%) — |pred_rows - gold_rows| / max ├── value overlap (50%) — Jaccard of cell values └── numeric range (25%) — log-distance proximity │ v bin to {0.0, 0.25, 0.5, 0.75, 1.0} delta = binned - previous_progress → delta * 0.15 (positive = improvement, negative = regression) │ v Clip per step to [-0.05, +0.15] No cumulative tracking │ v (on ANSWER action) Layer 3: Terminal Correctness └── verify_answer() → +1.0 (correct) or 0.0 (wrong) ``` ### Flow: TRL Training Integration ```text GRPOTrainer │ │── discovers tool methods via docstrings │ (describe, sample, query, answer) │ │── per rollout: │ SQLEnvTRL() → SQLEnvironment (internal) │ .reset() → observation string │ .describe(table) → schema string │ .query(sql) → result string │ .answer(value) → final string │ │── reward: │ sql_env_reward_func() → accumulated .reward │ v Training loop (GRPO: generate N completions, rank by reward) ``` --- ## Shared Data Models Defined in `models.py`. These cross the HTTP boundary between client and server. ### SQLAction (agent -> server) ```python class SQLAction(Action): action_type: str # DESCRIBE | SAMPLE | QUERY | ANSWER argument: str # table name, SQL string, or answer value ``` ### SQLObservation (server -> agent) ```python class SQLObservation(Observation): question: str # NL question to answer schema_info: str # known schema (incrementally revealed) result: str # last action result (truncated) error: str # error message if action failed step_count: int # current step number budget_remaining: int # steps left action_history: list[str] # summary of prior actions # Inherited: done (bool), reward (float | None) ``` ### SQLState (metadata endpoint) ```python class SQLState(State): history_messages: list[Message] current_action_type: str ``` ### Server-Only Types (never sent to agent) ```python @dataclass class QuestionRecord: question_id: str question_text: str database_name: str gold_sql: str gold_answer: str answer_type: str # integer | float | string | list difficulty: str # easy | medium | hard tables_involved: list[str] @dataclass class EpisodeContext: episode_id: str db_connection: sqlite3.Connection question_record: QuestionRecord step_count: int = 0 budget: int = 15 described_tables: set[str] action_log: list[str] done: bool = False gold_answer: str | None gold_rows: list[tuple] # Reward accumulators query_hashes: set[str] best_progress: float = 0.0 cumulative_step_reward: float = 0.0 cumulative_new_info_reward: float = 0.0 ``` **POMDP design:** The agent sees `SQLObservation`. The server holds `EpisodeContext`. The agent never sees gold answers, progress scores, or the full database. This separation forces exploration. --- ## API Contracts ### HTTP (OpenEnv Protocol) The server exposes HTTP endpoints via `openenv.core.env_server.create_app()`. | Operation | Method | Payload | Response | |-----------|--------|---------|----------| | Reset | `POST /reset` | `{seed: int}` (optional) | `SQLObservation` (JSON) | | Step | `POST /step` | `{action_type, argument, metadata}` | `{observation, reward, done, info}` | | State | `GET /state` | — | `SQLState` (JSON) | ### Evaluation API ```python # Policy protocol class Policy(Protocol): def select_action(self, observation: SQLObservation) -> SQLAction: ... # Built-in policies RandomPolicy() # random baseline OraclePolicy(questions) # gold-answer upper bound # Runner evaluate(env, policy, n_episodes, seed) -> EvaluationResult # .success_rate, .avg_reward, .avg_steps, .episodes[] ``` ### TRL Adapter API ```python SQLEnvTRL.configure(questions_path, db_dir, step_budget) # class method # Tool methods (auto-discovered by TRL): SQLEnvTRL.describe(table_name: str) -> str SQLEnvTRL.sample(table_name: str) -> str SQLEnvTRL.query(sql: str) -> str SQLEnvTRL.answer(value: str) -> str ``` --- ## Cross-Cutting Concerns ### SQL Safety All database access enforces: - **Read-only** SQLite connections (`file:...?mode=ro`) - **SELECT-only** — rejects INSERT, UPDATE, DELETE, ALTER, DROP - **Single statement** — rejects `; ...` (no stacked queries) - **5-second timeout** via SQLite progress handler - **20-row truncation** on all result sets ### POMDP Structure The partial observability is deliberate and load-bearing: - Agent sees table names at reset but **not column details** (must DESCRIBE) - Query results are **truncated** (at most 20 rows) - Agent never sees `gold_answer`, `best_progress`, or `gold_rows` - Step budget (default 15) forces strategic allocation of exploration ### Import Compatibility Dual import paths throughout for local vs Docker execution: ```python try: from sql_env.models import SQLAction # local / pip install except ImportError: from models import SQLAction # Docker (PYTHONPATH=/app/env) ``` ### Configuration | Variable | Required | Default | Purpose | |----------|----------|---------|---------| | `QUESTIONS_PATH` | No | `data/questions/student_assessment.json` | Questions JSON | | `DB_DIR` | No | `data/databases/` | SQLite database directory | | `TOKENIZER_NAME` | No | `mistralai/Mistral-7B-Instruct-v0.1` | HuggingFace tokenizer | | `PORT` | No | `8000` | Server port | --- ## Data, State, and Storage ### Committed Data | Path | Contents | |------|----------| | `data/questions/questions_train.json` | 473 training questions across 10 DBs | | `data/questions/questions_eval.json` | 203 evaluation questions across 10 DBs | | `data/questions/db_list.json` | 10 Spider database IDs | | `data/databases/models.py` | Legacy SQLAlchemy ORM models | ### Downloaded Data (gitignored) Spider SQLite databases in `data/databases/{db_id}/{db_id}.sqlite`. Downloaded via `scripts/download_spider_databases.py`. The 10 databases: student_assessment, concert_singer, world_1, car_1, employee_hire_evaluation, pets_1, cre_Doc_Template_Mgt, dog_kennels, flight_2, poker_player. ### Runtime State (in-memory, per episode) `EpisodeContext` holds all episode state: DB connection, gold data, reward accumulators, action history. Created on `reset()`, discarded when episode ends. Nothing persists between episodes. --- Snapshot (auto-managed) - Repo signals: Python (pyproject.toml) - Roots: tests/ - Entrypoint candidates: (none detected) ```text tests/ e2e/ test_training_e2e.py integration/ test_training_pipeline.py unit/ test_error_handling.py test_grpo_config.py test_oracle_policy.py test_prompts.py test_reward.py test_rewards.py test_rollout.py test_sft_terminal_message.py test_trl_adapter.py test_evaluation.py test_smoke.py test_synthetic.py test_trl_adapter.py test_verifier.py test_verifier_integration.py ``` --- ## Infrastructure ### Development **Prerequisites:** Python 3.11-3.12, `uv`, Docker (for deployment) ```bash git clone && cd sql-env uv sync uv run python scripts/download_spider_databases.py uv run pytest tests/ -v ``` ### Deployment **Target:** HuggingFace Spaces (Docker, free tier) ```bash uv run openenv build # build Docker image uv run openenv push # push to HF Spaces ``` The Dockerfile uses multi-stage build with `openenv-base`, runs as non-root `appuser`, bundles Spider databases, and exposes port 8000. --- ## Invariants - Token tensors in `SQLState` grow monotonically across turns (never shrink mid-episode) - `EpisodeContext` is server-only — leaking gold data to the agent breaks the POMDP - Per-step rewards clipped to `[-0.05, 0.15]` — terminal reward (+1.0) always dominates exploration (~0.3 max) - `tests/` must pass without GPU, without network, without downloaded databases (mocks/fixtures) - SQL execution never mutates the database (read-only mode enforced at connection level) --- ## Glossary | Term | Definition | |------|------------| | Episode | One question-answering session: reset -> N steps -> terminal | | Action type | One of: DESCRIBE, SAMPLE, QUERY, ANSWER | | POMDP | Partially observable MDP. Agent acts under uncertainty | | Spider | Academic text-to-SQL benchmark dataset (10 DBs used) | | OpenEnv | Meta's RL environment framework (Environment, EnvClient) | | Green Agent | OpenEnv's evaluation wrapper pattern | | Oracle policy | Baseline that uses gold SQL/answer for reward ceiling validation | | TRL | HuggingFace Transformer Reinforcement Learning library | | GRPO | Group Relative Policy Optimization (RL algorithm used for training) | | Dense reward | Per-step reward signal (vs sparse terminal-only reward) | --- ## References - OpenEnv framework: https://github.com/meta-pytorch/OpenEnv - Spider dataset: https://huggingface.co/datasets/xlangai/spider - TRL OpenEnv docs: https://huggingface.co/docs/trl/openenv