sql_env / specs /F001-RESEARCH_SUMMARY.md
hjerpe's picture
Upload folder using huggingface_hub
5dd1bb4 verified

Research Summary

Project: sql-env Change: F001 - Core Environment Loop (step/reset lifecycle with structured actions, SQLite execution, sandboxing, question loading, step budget) Date: 2026-03-24 Status: Draft


1. Change Overview

What We're Changing

Complete the step/reset lifecycle so the environment actually executes SQL. Currently step() delegates to Ollama for action interpretation and SQL generation -- the environment never touches SQLite. This feature removes the Ollama dependency from step(), accepts structured actions (DESCRIBE table_name, SAMPLE table_name, QUERY sql_string, ANSWER value), wires up SQLite execution with sandboxing (read-only, 5s timeout, SELECT-only), loads questions from JSON on reset(), enforces a 15-step budget, and handles episode termination.

Why We're Changing It

The environment is architecturally broken for RL use. In an RL environment, the AGENT generates actions and the ENVIRONMENT executes them deterministically. Currently the environment calls Ollama inside step() to (a) select which table to DESCRIBE and (b) generate SQL for QUERY actions. This makes the environment non-deterministic and couples it to an external LLM service. The v1 spec defines structured actions where the agent provides table names and SQL directly.

Success Criteria

  • Agent sends DESCRIBE employees and immediately sees column names and types
  • Queries execute in <100ms with clean truncated output (max 20 rows)
  • Bad SQL returns a clear error message the agent can learn from
  • Episode ends cleanly when budget exhausted or ANSWER submitted
  • No Ollama dependency in the environment's step/reset path

2. System Context

Current Behavior

SQLEnvironment inherits from OpenEnv Environment[SQLAction, SQLObservation, SQLState]. On reset(), it clears history_messages/history_tokens and re-adds the system prompt. On step(), it dispatches on action.action_type:

  • describe -> calls _call_ollama_to_select_table() then _get_table_schema()
  • query -> calls _call_ollama_for_sql() (generates SQL but never executes it)
  • sample -> calls _call_ollama_to_select_table() then _generate_sample_query() (generates SQL string but never executes it)

No ANSWER action exists. No questions are loaded. No SQLite database connection exists. No step budget is tracked. The SQLObservation currently only carries messages and tokens (the richer fields like question, schema_info, result, error, step_count, budget_remaining, action_history are commented out in models.py).

Architecture Context

Agent (external) --WebSocket/HTTP--> FastAPI (server/app.py)
                                       |
                                       v
                              SQLEnvironment (server/sql_environment.py)
                                       |
                                 [MISSING: SQLite connection]
                                 [MISSING: Question loading]
                                 [MISSING: Episode context]
                                       |
                              models.py (SQLAction, SQLObservation, SQLState)
                              data/questions/student_assessment.json (53 Q&A pairs)
                              data/databases/models.py (9 ORM tables, no .sqlite file)

Entry Points

Entry Point Trigger Current Flow
POST /reset (via OpenEnv create_app) Client calls reset SQLEnvironment.reset() -> clears history, returns observation with system prompt
POST /step (via OpenEnv create_app) Client sends action SQLEnvironment.step(action) -> dispatches on action_type, calls Ollama, appends messages
GET /state Client queries state Returns SQLState (history_messages, history_tokens, current_action_type)
WebSocket /ws Persistent connection Same reset/step but over WS

Data Flow

Data Source Shape/Type Destination
SQLAction Agent via HTTP/WS {action_type: str, action_description: str, tokens: Tensor} SQLEnvironment.step()
SQLObservation Environment {messages: list[Message], tokens: Tensor, done: bool, reward: float} Agent via HTTP/WS
Questions JSON data/questions/student_assessment.json [{db_id, query, question, query_toks, query_toks_no_value, ...}] Loaded at reset() to pick an episode question
SQLite database data/databases/ (NOT YET PRESENT) .sqlite file Read-only connection per episode
ORM models data/databases/models.py 9 SQLAlchemy classes Used by _get_table_schema() for column introspection

3. Dependencies

Code We Depend On

Dependency What We Use Risk if Changed
openenv.core.env_server.interfaces.Environment Base class: reset(seed, episode_id, **kwargs) -> ObsT, step(action, timeout_s, **kwargs) -> ObsT, state property Signature changes break our overrides
openenv.core.env_server.types.Action, Observation, State Pydantic base models for SQLAction, SQLObservation, SQLState Field additions could conflict
openenv.core.env_server.create_app FastAPI app factory that wires endpoints to our environment N/A (stable)
openenv.core.env_server.interfaces.ModelTokenizer Protocol for tokenizer (apply_chat_template, decode) Only used for token history -- not needed for F001 core logic
SQLAlchemy ORM models (data/databases/models.py) 9 model classes for table introspection via __table__.columns Column/table name drift breaks schema descriptions
sqlite3 (stdlib) Will be used for query execution Stable
torch Tensor operations for token history Current coupling is heavy but out of F001 scope to change

Code That Depends On Us

Dependent How They Use Us Impact of Our Change
server/app.py Creates SQLEnvironment via factory, passes to create_app() Constructor signature change if we add params (e.g., questions_path, db_path)
client.py (SQLEnvClient) _step_payload() serializes SQLAction, _parse_result() deserializes SQLObservation If SQLObservation fields change (uncomment rich fields), client must be updated
tests/test_smoke.py Tests reset(), step(), message_to_action(), schema introspection Tests will need updating for new behavior (step now executes SQL, reset now loads question)

External Systems

System Integration Point Considerations
SQLite database files data/databases/*.sqlite Files do NOT exist yet. ORM models define schema but no .sqlite file is present. Need to either: (a) generate from ORM models via SQLAlchemy create_all() + seed data, or (b) download Spider database files
Spider dataset (HuggingFace) scripts/download_spider_data.py Downloads question JSON only. Does NOT download the actual SQLite database files. The student_assessment.json references db_id: "student_assessment" but no corresponding .sqlite exists
Ollama (BEING REMOVED) _call_ollama_to_select_table(), _call_ollama_for_sql() These will be deleted. No fallback needed -- the agent provides structured actions directly

4. Risks & Edge Cases

Identified Risks

Risk Likelihood Impact Mitigation
No .sqlite database file exists Certain Environment cannot execute SQL at all Must create/download database before this feature works. SQLAlchemy Base.metadata.create_all(engine) can create empty tables, but data seeding is needed for meaningful SAMPLE/QUERY results
Question JSON has no gold_answer field High Cannot implement ANSWER verification Spider format has query (gold SQL) but not a pre-computed gold answer. Must either run gold SQL at reset() to compute answer, or defer ANSWER verification to Phase 2 (verifier.py)
SQLObservation field changes break client Medium Client deserialization fails Update SQLEnvClient._parse_result() alongside observation changes
Existing tests assume Ollama-based step behavior Medium Tests break Rewrite tests to use structured actions
SQL injection via QUERY action Medium Agent could run destructive SQL Enforce SELECT-only parsing + read-only SQLite connection + 5s timeout
message_to_action() pipeline conflicts with structured actions Medium Two action creation paths (NL keyword detection vs. structured) The message_to_action() + _detect_action_type() pipeline is designed for the Ollama-based flow. With structured actions, the agent sends action_type directly. Need to decide whether to keep/remove message_to_action()

Edge Cases to Handle

Edge Case Current Behavior Required Behavior
DESCRIBE with invalid table name Ollama guesses a table Return error: "Table 'xyz' not found. Available tables: ..."
QUERY with non-SELECT SQL (INSERT, DROP, etc.) Never executed Reject with clear error before execution
QUERY that times out (>5s) Never executed Kill query, return timeout error
QUERY returning >20 rows Never executed Truncate to 20 rows, indicate truncation
Budget exhausted (15 steps) without ANSWER No budget tracking Set done=True, reward=0, return termination observation
ANSWER action Not implemented Compare answer to gold, set done=True, compute reward
Empty/null action_description Passed to Ollama Validate and return error
reset() called mid-episode Clears history only Close SQLite connection, pick new question, open new connection
DESCRIBE "all" Not handled distinctly Return all table names (per v1 spec field description)

Invariants to Preserve

  • OpenEnv Environment interface contract: reset() returns ObsT, step() returns ObsT, state property returns StateT
  • Observation.done=True only on episode termination (ANSWER submitted or budget exhausted)
  • SQLite connection is always read-only (no writes possible)
  • Step budget decrements only on non-ANSWER actions
  • Agent never sees gold_sql or gold_answer in observations

4b. Code Shape & Design Target

Existing Vocabulary

Concept Existing Name Location
Agent action SQLAction (action_type, action_description, tokens) models.py
Environment response SQLObservation (messages, tokens, done, reward) models.py
Episode metadata SQLState (history_messages, history_tokens, current_action_type) models.py
Table introspection _get_table_schema(table_name) server/sql_environment.py
Type conversion _sqlalchemy_type_to_natural_language(col_type) server/sql_environment.py
Sample query generation _generate_sample_query(table_name, limit) server/sql_environment.py
Core environment SQLEnvironment(Environment[SQLAction, SQLObservation, SQLState]) server/sql_environment.py
Per-episode state (conceptual) EpisodeContext (commented design outline) models.py lines 130-247
Question record (conceptual) QuestionRecord models.py lines 224-236
Answer verification (stub) server/verifier.py Placeholder only
Reward computation (stub) server/reward.py Placeholder only

Language/Framework Idioms

  • Pydantic models for all wire types (Action, Observation, State) -- follows OpenEnv pattern
  • SQLAlchemy ORM for schema definition, but NOT for query execution (agents run raw SQL)
  • FastAPI via OpenEnv's create_app() factory pattern
  • TypedDict for Message ({role: str, content: str})
  • Private methods prefixed with _ for internal helpers
  • Logging via logging.getLogger(__name__)
  • Type annotations throughout, including generics on Environment base class
  • torch.Tensor for token storage (inherited from OpenEnv pattern)
  • dataclass-style Pydantic with Field() for all model fields

Target Shape

Component Purpose Why This Boundary
EpisodeContext (dataclass or Pydantic) Per-episode server state: db_connection, question, step_count, budget, described_tables, action_log Already designed in models.py comments; isolates episode state from environment singleton
_execute_sql(sql, timeout) Sandboxed SQL execution: SELECT-only check, read-only connection, timeout, truncation Single responsibility; reused by both QUERY and internal gold-answer computation
_handle_describe(table_name) Return schema for a specific table Already exists as _get_table_schema(), just needs to use agent-provided table name directly instead of Ollama
_handle_sample(table_name) Execute SELECT * FROM table LIMIT N via _execute_sql() Already exists as _generate_sample_query(), needs to actually execute the SQL
_handle_query(sql_string) Validate and execute agent-provided SQL New; wraps _execute_sql() with SELECT-only validation
_handle_answer(value) Compare to gold answer, set done=True, compute terminal reward New; minimal for MVP (delegate to verifier.py later)
_load_questions(path) Load and parse question JSON Simple loader; called once at init or lazily
_open_db(db_path) Open read-only SQLite connection with timeout Called at reset(); isolated for testability
_build_observation() Construct SQLObservation from episode context Replace current _create_observation() which only handles messages/tokens

Abstraction Level

  • Current level: Mostly flat -- single class with private helper methods. No service layer, no repository pattern. The conceptual EpisodeContext is outlined but not implemented.
  • Recommendation: Stay flat. Add EpisodeContext as a dataclass to hold per-episode state (replacing the current self._state which mixes episode data with token history). Keep all action handlers as private methods on SQLEnvironment. Do not introduce a separate service class or handler classes.

Anti-Patterns to Avoid

  • Do not create separate handler classes (e.g., DescribeHandler, QueryHandler) -- this codebase uses private methods on the environment class
  • Do not over-abstract the SQL execution -- a single _execute_sql() method is sufficient; no need for a query builder or execution strategy pattern
  • Do not keep the message_to_action() + _detect_action_type() pipeline for the new structured flow -- these are artifacts of the Ollama-based NL interpretation. The agent now sends structured actions directly. However, if message_to_action() is part of the OpenEnv contract, it may need to be preserved as a thin adapter
  • Do not add reward computation in F001 -- server/reward.py is explicitly a Phase 3 stub. For F001, use simple terminal reward (1.0 for correct ANSWER, 0.0 otherwise)
  • Avoid a single 200-line step() method -- dispatch to _handle_describe(), _handle_sample(), _handle_query(), _handle_answer() and keep each under 30 lines

5. Constraints

Technical Constraints

Constraint Requirement Notes
Query execution latency < 100ms SQLite on local disk with small Spider databases; should be trivial
Query timeout 5 seconds max Use sqlite3.Connection.set_progress_handler() or execute in a thread with timeout
Read-only access No writes to database Open SQLite with ?mode=ro URI or use PRAGMA query_only = ON
SELECT-only queries Block INSERT/UPDATE/DELETE/DROP/CREATE/ALTER Parse SQL prefix before execution
Output truncation Max 20 rows Truncate result set, add "... (N more rows)" indicator
Step budget 15 steps per episode (configurable) Decrement on non-ANSWER actions

Pattern Constraints

  • Must implement reset(seed, episode_id, **kwargs) -> SQLObservation matching OpenEnv base signature
  • Must implement step(action, timeout_s, **kwargs) -> SQLObservation matching OpenEnv base signature
  • Must maintain state property returning SQLState
  • Pydantic models must remain serializable over HTTP/WebSocket (no raw sqlite3 objects in observations)
  • Constructor must remain compatible with create_sql_environment() factory in app.py

Testing Constraints

Test Suite Coverage Area Notes
tests/test_smoke.py Models, environment reset/step, action detection, message_to_action, client serialization, schema introspection 6 test classes, ~20 tests. Several will break: step tests assume Ollama-based behavior; reset tests assume no question loading. TestActionDetection and TestMessageToAction test NL keyword detection which may be deprecated

6. Open Questions

Question Why It Matters Who Can Answer
Where do the SQLite database files come from? No .sqlite files exist. ORM models define schema but Spider databases are separate artifacts. download_spider_data.py only downloads question JSON, not databases Developer decision: generate from ORM + seed, or download Spider DBs
Should action_description carry the structured argument, or should we add a dedicated argument field to SQLAction? v1 spec defines argument: str as a separate field. Current code uses action_description for NL text. Using action_description avoids a model change but is semantically misleading Developer (API design decision)
What happens to message_to_action() and _detect_action_type()? These convert NL messages to actions using keyword matching. With structured actions, the agent sends action_type directly. But message_to_action might be part of the OpenEnv contract Developer + OpenEnv docs
Should SQLObservation fields be uncommented (question, schema_info, result, error, etc.) or should we continue using messages-only? The v1 spec and the commented-out fields describe a rich observation. The current implementation uses only messages + tokens. Rich fields make the API cleaner for RL agents Developer (observation design decision)

7. Context Sources

Source Type Notes
server/sql_environment.py Code Main environment: 546 lines, Ollama-dependent step(), no SQL execution
models.py Code Wire types + conceptual EpisodeContext design (commented)
server/app.py Code FastAPI factory, tokenizer setup
data/databases/models.py Code 9 SQLAlchemy ORM tables (student_assessment schema)
data/questions/student_assessment.json Data 53 Spider questions with gold SQL, db_id, tokenized queries
docs_draft/SQLEnv_Concept_v1.md Spec V1 design: action space, episode lifecycle, reward architecture, anti-gaming
server/verifier.py Code Stub -- placeholder for Phase 2 answer verification
server/reward.py Code Stub -- placeholder for Phase 3 reward computation
scripts/download_spider_data.py Code Downloads question JSON from HuggingFace Spider dataset (not databases)
tests/test_smoke.py Code 6 test classes, ~20 tests covering models, env, action detection, client
server/test_sql_env.py Code MockTokenizer for testing without transformers
client.py Code SQLEnvClient with _step_payload() and _parse_result()
OpenEnv interfaces.py Code (vendored) Environment base class: reset(seed, episode_id), step(action, timeout_s), state property
OpenEnv types.py Code (vendored) Action, Observation, State Pydantic base models

Human Validation Checkpoint

Before proceeding to planning, please confirm:

  • System context is accurate
  • Dependencies are complete
  • Risks are identified
  • Constraints are correct
  • Open questions can be resolved

Questions for reviewer:

  1. Is anything incorrect or missing?
  2. Are there risks I haven't identified?
  3. Should we proceed to planning?

Validated by: [NAME] on [DATE]