Research Summary
Project: sql-env
Change: F001 - Core Environment Loop (step/reset lifecycle with structured actions, SQLite execution, sandboxing, question loading, step budget)
Date: 2026-03-24
Status: Draft
1. Change Overview
What We're Changing
Complete the step/reset lifecycle so the environment actually executes SQL. Currently step() delegates to Ollama for action interpretation and SQL generation -- the environment never touches SQLite. This feature removes the Ollama dependency from step(), accepts structured actions (DESCRIBE table_name, SAMPLE table_name, QUERY sql_string, ANSWER value), wires up SQLite execution with sandboxing (read-only, 5s timeout, SELECT-only), loads questions from JSON on reset(), enforces a 15-step budget, and handles episode termination.
Why We're Changing It
The environment is architecturally broken for RL use. In an RL environment, the AGENT generates actions and the ENVIRONMENT executes them deterministically. Currently the environment calls Ollama inside step() to (a) select which table to DESCRIBE and (b) generate SQL for QUERY actions. This makes the environment non-deterministic and couples it to an external LLM service. The v1 spec defines structured actions where the agent provides table names and SQL directly.
Success Criteria
- Agent sends
DESCRIBE employees and immediately sees column names and types
- Queries execute in <100ms with clean truncated output (max 20 rows)
- Bad SQL returns a clear error message the agent can learn from
- Episode ends cleanly when budget exhausted or ANSWER submitted
- No Ollama dependency in the environment's step/reset path
2. System Context
Current Behavior
SQLEnvironment inherits from OpenEnv Environment[SQLAction, SQLObservation, SQLState]. On reset(), it clears history_messages/history_tokens and re-adds the system prompt. On step(), it dispatches on action.action_type:
describe -> calls _call_ollama_to_select_table() then _get_table_schema()
query -> calls _call_ollama_for_sql() (generates SQL but never executes it)
sample -> calls _call_ollama_to_select_table() then _generate_sample_query() (generates SQL string but never executes it)
No ANSWER action exists. No questions are loaded. No SQLite database connection exists. No step budget is tracked. The SQLObservation currently only carries messages and tokens (the richer fields like question, schema_info, result, error, step_count, budget_remaining, action_history are commented out in models.py).
Architecture Context
Agent (external) --WebSocket/HTTP--> FastAPI (server/app.py)
|
v
SQLEnvironment (server/sql_environment.py)
|
[MISSING: SQLite connection]
[MISSING: Question loading]
[MISSING: Episode context]
|
models.py (SQLAction, SQLObservation, SQLState)
data/questions/student_assessment.json (53 Q&A pairs)
data/databases/models.py (9 ORM tables, no .sqlite file)
Entry Points
| Entry Point |
Trigger |
Current Flow |
POST /reset (via OpenEnv create_app) |
Client calls reset |
SQLEnvironment.reset() -> clears history, returns observation with system prompt |
POST /step (via OpenEnv create_app) |
Client sends action |
SQLEnvironment.step(action) -> dispatches on action_type, calls Ollama, appends messages |
GET /state |
Client queries state |
Returns SQLState (history_messages, history_tokens, current_action_type) |
WebSocket /ws |
Persistent connection |
Same reset/step but over WS |
Data Flow
| Data |
Source |
Shape/Type |
Destination |
| SQLAction |
Agent via HTTP/WS |
{action_type: str, action_description: str, tokens: Tensor} |
SQLEnvironment.step() |
| SQLObservation |
Environment |
{messages: list[Message], tokens: Tensor, done: bool, reward: float} |
Agent via HTTP/WS |
| Questions JSON |
data/questions/student_assessment.json |
[{db_id, query, question, query_toks, query_toks_no_value, ...}] |
Loaded at reset() to pick an episode question |
| SQLite database |
data/databases/ (NOT YET PRESENT) |
.sqlite file |
Read-only connection per episode |
| ORM models |
data/databases/models.py |
9 SQLAlchemy classes |
Used by _get_table_schema() for column introspection |
3. Dependencies
Code We Depend On
| Dependency |
What We Use |
Risk if Changed |
openenv.core.env_server.interfaces.Environment |
Base class: reset(seed, episode_id, **kwargs) -> ObsT, step(action, timeout_s, **kwargs) -> ObsT, state property |
Signature changes break our overrides |
openenv.core.env_server.types.Action, Observation, State |
Pydantic base models for SQLAction, SQLObservation, SQLState |
Field additions could conflict |
openenv.core.env_server.create_app |
FastAPI app factory that wires endpoints to our environment |
N/A (stable) |
openenv.core.env_server.interfaces.ModelTokenizer |
Protocol for tokenizer (apply_chat_template, decode) |
Only used for token history -- not needed for F001 core logic |
SQLAlchemy ORM models (data/databases/models.py) |
9 model classes for table introspection via __table__.columns |
Column/table name drift breaks schema descriptions |
sqlite3 (stdlib) |
Will be used for query execution |
Stable |
torch |
Tensor operations for token history |
Current coupling is heavy but out of F001 scope to change |
Code That Depends On Us
| Dependent |
How They Use Us |
Impact of Our Change |
server/app.py |
Creates SQLEnvironment via factory, passes to create_app() |
Constructor signature change if we add params (e.g., questions_path, db_path) |
client.py (SQLEnvClient) |
_step_payload() serializes SQLAction, _parse_result() deserializes SQLObservation |
If SQLObservation fields change (uncomment rich fields), client must be updated |
tests/test_smoke.py |
Tests reset(), step(), message_to_action(), schema introspection |
Tests will need updating for new behavior (step now executes SQL, reset now loads question) |
External Systems
| System |
Integration Point |
Considerations |
| SQLite database files |
data/databases/*.sqlite |
Files do NOT exist yet. ORM models define schema but no .sqlite file is present. Need to either: (a) generate from ORM models via SQLAlchemy create_all() + seed data, or (b) download Spider database files |
| Spider dataset (HuggingFace) |
scripts/download_spider_data.py |
Downloads question JSON only. Does NOT download the actual SQLite database files. The student_assessment.json references db_id: "student_assessment" but no corresponding .sqlite exists |
| Ollama (BEING REMOVED) |
_call_ollama_to_select_table(), _call_ollama_for_sql() |
These will be deleted. No fallback needed -- the agent provides structured actions directly |
4. Risks & Edge Cases
Identified Risks
| Risk |
Likelihood |
Impact |
Mitigation |
| No .sqlite database file exists |
Certain |
Environment cannot execute SQL at all |
Must create/download database before this feature works. SQLAlchemy Base.metadata.create_all(engine) can create empty tables, but data seeding is needed for meaningful SAMPLE/QUERY results |
Question JSON has no gold_answer field |
High |
Cannot implement ANSWER verification |
Spider format has query (gold SQL) but not a pre-computed gold answer. Must either run gold SQL at reset() to compute answer, or defer ANSWER verification to Phase 2 (verifier.py) |
| SQLObservation field changes break client |
Medium |
Client deserialization fails |
Update SQLEnvClient._parse_result() alongside observation changes |
| Existing tests assume Ollama-based step behavior |
Medium |
Tests break |
Rewrite tests to use structured actions |
| SQL injection via QUERY action |
Medium |
Agent could run destructive SQL |
Enforce SELECT-only parsing + read-only SQLite connection + 5s timeout |
message_to_action() pipeline conflicts with structured actions |
Medium |
Two action creation paths (NL keyword detection vs. structured) |
The message_to_action() + _detect_action_type() pipeline is designed for the Ollama-based flow. With structured actions, the agent sends action_type directly. Need to decide whether to keep/remove message_to_action() |
Edge Cases to Handle
| Edge Case |
Current Behavior |
Required Behavior |
| DESCRIBE with invalid table name |
Ollama guesses a table |
Return error: "Table 'xyz' not found. Available tables: ..." |
| QUERY with non-SELECT SQL (INSERT, DROP, etc.) |
Never executed |
Reject with clear error before execution |
| QUERY that times out (>5s) |
Never executed |
Kill query, return timeout error |
| QUERY returning >20 rows |
Never executed |
Truncate to 20 rows, indicate truncation |
| Budget exhausted (15 steps) without ANSWER |
No budget tracking |
Set done=True, reward=0, return termination observation |
| ANSWER action |
Not implemented |
Compare answer to gold, set done=True, compute reward |
| Empty/null action_description |
Passed to Ollama |
Validate and return error |
| reset() called mid-episode |
Clears history only |
Close SQLite connection, pick new question, open new connection |
| DESCRIBE "all" |
Not handled distinctly |
Return all table names (per v1 spec field description) |
Invariants to Preserve
4b. Code Shape & Design Target
Existing Vocabulary
| Concept |
Existing Name |
Location |
| Agent action |
SQLAction (action_type, action_description, tokens) |
models.py |
| Environment response |
SQLObservation (messages, tokens, done, reward) |
models.py |
| Episode metadata |
SQLState (history_messages, history_tokens, current_action_type) |
models.py |
| Table introspection |
_get_table_schema(table_name) |
server/sql_environment.py |
| Type conversion |
_sqlalchemy_type_to_natural_language(col_type) |
server/sql_environment.py |
| Sample query generation |
_generate_sample_query(table_name, limit) |
server/sql_environment.py |
| Core environment |
SQLEnvironment(Environment[SQLAction, SQLObservation, SQLState]) |
server/sql_environment.py |
| Per-episode state (conceptual) |
EpisodeContext (commented design outline) |
models.py lines 130-247 |
| Question record (conceptual) |
QuestionRecord |
models.py lines 224-236 |
| Answer verification (stub) |
server/verifier.py |
Placeholder only |
| Reward computation (stub) |
server/reward.py |
Placeholder only |
Language/Framework Idioms
- Pydantic models for all wire types (Action, Observation, State) -- follows OpenEnv pattern
- SQLAlchemy ORM for schema definition, but NOT for query execution (agents run raw SQL)
- FastAPI via OpenEnv's
create_app() factory pattern
- TypedDict for Message (
{role: str, content: str})
- Private methods prefixed with
_ for internal helpers
- Logging via
logging.getLogger(__name__)
- Type annotations throughout, including generics on Environment base class
- torch.Tensor for token storage (inherited from OpenEnv pattern)
- dataclass-style Pydantic with Field() for all model fields
Target Shape
| Component |
Purpose |
Why This Boundary |
EpisodeContext (dataclass or Pydantic) |
Per-episode server state: db_connection, question, step_count, budget, described_tables, action_log |
Already designed in models.py comments; isolates episode state from environment singleton |
_execute_sql(sql, timeout) |
Sandboxed SQL execution: SELECT-only check, read-only connection, timeout, truncation |
Single responsibility; reused by both QUERY and internal gold-answer computation |
_handle_describe(table_name) |
Return schema for a specific table |
Already exists as _get_table_schema(), just needs to use agent-provided table name directly instead of Ollama |
_handle_sample(table_name) |
Execute SELECT * FROM table LIMIT N via _execute_sql() |
Already exists as _generate_sample_query(), needs to actually execute the SQL |
_handle_query(sql_string) |
Validate and execute agent-provided SQL |
New; wraps _execute_sql() with SELECT-only validation |
_handle_answer(value) |
Compare to gold answer, set done=True, compute terminal reward |
New; minimal for MVP (delegate to verifier.py later) |
_load_questions(path) |
Load and parse question JSON |
Simple loader; called once at init or lazily |
_open_db(db_path) |
Open read-only SQLite connection with timeout |
Called at reset(); isolated for testability |
_build_observation() |
Construct SQLObservation from episode context |
Replace current _create_observation() which only handles messages/tokens |
Abstraction Level
- Current level: Mostly flat -- single class with private helper methods. No service layer, no repository pattern. The conceptual
EpisodeContext is outlined but not implemented.
- Recommendation: Stay flat. Add
EpisodeContext as a dataclass to hold per-episode state (replacing the current self._state which mixes episode data with token history). Keep all action handlers as private methods on SQLEnvironment. Do not introduce a separate service class or handler classes.
Anti-Patterns to Avoid
- Do not create separate handler classes (e.g., DescribeHandler, QueryHandler) -- this codebase uses private methods on the environment class
- Do not over-abstract the SQL execution -- a single
_execute_sql() method is sufficient; no need for a query builder or execution strategy pattern
- Do not keep the
message_to_action() + _detect_action_type() pipeline for the new structured flow -- these are artifacts of the Ollama-based NL interpretation. The agent now sends structured actions directly. However, if message_to_action() is part of the OpenEnv contract, it may need to be preserved as a thin adapter
- Do not add reward computation in F001 --
server/reward.py is explicitly a Phase 3 stub. For F001, use simple terminal reward (1.0 for correct ANSWER, 0.0 otherwise)
- Avoid a single 200-line step() method -- dispatch to
_handle_describe(), _handle_sample(), _handle_query(), _handle_answer() and keep each under 30 lines
5. Constraints
Technical Constraints
| Constraint |
Requirement |
Notes |
| Query execution latency |
< 100ms |
SQLite on local disk with small Spider databases; should be trivial |
| Query timeout |
5 seconds max |
Use sqlite3.Connection.set_progress_handler() or execute in a thread with timeout |
| Read-only access |
No writes to database |
Open SQLite with ?mode=ro URI or use PRAGMA query_only = ON |
| SELECT-only queries |
Block INSERT/UPDATE/DELETE/DROP/CREATE/ALTER |
Parse SQL prefix before execution |
| Output truncation |
Max 20 rows |
Truncate result set, add "... (N more rows)" indicator |
| Step budget |
15 steps per episode (configurable) |
Decrement on non-ANSWER actions |
Pattern Constraints
- Must implement
reset(seed, episode_id, **kwargs) -> SQLObservation matching OpenEnv base signature
- Must implement
step(action, timeout_s, **kwargs) -> SQLObservation matching OpenEnv base signature
- Must maintain
state property returning SQLState
- Pydantic models must remain serializable over HTTP/WebSocket (no raw sqlite3 objects in observations)
- Constructor must remain compatible with
create_sql_environment() factory in app.py
Testing Constraints
| Test Suite |
Coverage Area |
Notes |
tests/test_smoke.py |
Models, environment reset/step, action detection, message_to_action, client serialization, schema introspection |
6 test classes, ~20 tests. Several will break: step tests assume Ollama-based behavior; reset tests assume no question loading. TestActionDetection and TestMessageToAction test NL keyword detection which may be deprecated |
6. Open Questions
| Question |
Why It Matters |
Who Can Answer |
| Where do the SQLite database files come from? |
No .sqlite files exist. ORM models define schema but Spider databases are separate artifacts. download_spider_data.py only downloads question JSON, not databases |
Developer decision: generate from ORM + seed, or download Spider DBs |
Should action_description carry the structured argument, or should we add a dedicated argument field to SQLAction? |
v1 spec defines argument: str as a separate field. Current code uses action_description for NL text. Using action_description avoids a model change but is semantically misleading |
Developer (API design decision) |
What happens to message_to_action() and _detect_action_type()? |
These convert NL messages to actions using keyword matching. With structured actions, the agent sends action_type directly. But message_to_action might be part of the OpenEnv contract |
Developer + OpenEnv docs |
| Should SQLObservation fields be uncommented (question, schema_info, result, error, etc.) or should we continue using messages-only? |
The v1 spec and the commented-out fields describe a rich observation. The current implementation uses only messages + tokens. Rich fields make the API cleaner for RL agents |
Developer (observation design decision) |
7. Context Sources
| Source |
Type |
Notes |
server/sql_environment.py |
Code |
Main environment: 546 lines, Ollama-dependent step(), no SQL execution |
models.py |
Code |
Wire types + conceptual EpisodeContext design (commented) |
server/app.py |
Code |
FastAPI factory, tokenizer setup |
data/databases/models.py |
Code |
9 SQLAlchemy ORM tables (student_assessment schema) |
data/questions/student_assessment.json |
Data |
53 Spider questions with gold SQL, db_id, tokenized queries |
docs_draft/SQLEnv_Concept_v1.md |
Spec |
V1 design: action space, episode lifecycle, reward architecture, anti-gaming |
server/verifier.py |
Code |
Stub -- placeholder for Phase 2 answer verification |
server/reward.py |
Code |
Stub -- placeholder for Phase 3 reward computation |
scripts/download_spider_data.py |
Code |
Downloads question JSON from HuggingFace Spider dataset (not databases) |
tests/test_smoke.py |
Code |
6 test classes, ~20 tests covering models, env, action detection, client |
server/test_sql_env.py |
Code |
MockTokenizer for testing without transformers |
client.py |
Code |
SQLEnvClient with _step_payload() and _parse_result() |
OpenEnv interfaces.py |
Code (vendored) |
Environment base class: reset(seed, episode_id), step(action, timeout_s), state property |
OpenEnv types.py |
Code (vendored) |
Action, Observation, State Pydantic base models |
Human Validation Checkpoint
Before proceeding to planning, please confirm:
Questions for reviewer:
- Is anything incorrect or missing?
- Are there risks I haven't identified?
- Should we proceed to planning?
Validated by: [NAME] on [DATE]