Spaces:

hjerpe
/

sql_env

Sleeping

App Files Files Community

sql_env / specs /F001-RESEARCH_SUMMARY.md

hjerpe

Upload folder using huggingface_hub

5dd1bb4 verified 23 days ago

preview code

raw

history blame contribute delete

20.3 kB

Research Summary

Project: sql-env Change: F001 - Core Environment Loop (step/reset lifecycle with structured actions, SQLite execution, sandboxing, question loading, step budget) Date: 2026-03-24 Status: Draft

1. Change Overview

What We're Changing

Complete the step/reset lifecycle so the environment actually executes SQL. Currently step() delegates to Ollama for action interpretation and SQL generation -- the environment never touches SQLite. This feature removes the Ollama dependency from step(), accepts structured actions (DESCRIBE table_name, SAMPLE table_name, QUERY sql_string, ANSWER value), wires up SQLite execution with sandboxing (read-only, 5s timeout, SELECT-only), loads questions from JSON on reset(), enforces a 15-step budget, and handles episode termination.

Why We're Changing It

The environment is architecturally broken for RL use. In an RL environment, the AGENT generates actions and the ENVIRONMENT executes them deterministically. Currently the environment calls Ollama inside step() to (a) select which table to DESCRIBE and (b) generate SQL for QUERY actions. This makes the environment non-deterministic and couples it to an external LLM service. The v1 spec defines structured actions where the agent provides table names and SQL directly.

Success Criteria

Agent sends DESCRIBE employees and immediately sees column names and types
Queries execute in <100ms with clean truncated output (max 20 rows)
Bad SQL returns a clear error message the agent can learn from
Episode ends cleanly when budget exhausted or ANSWER submitted
No Ollama dependency in the environment's step/reset path

2. System Context

Current Behavior

SQLEnvironment inherits from OpenEnv Environment[SQLAction, SQLObservation, SQLState]. On reset(), it clears history_messages/history_tokens and re-adds the system prompt. On step(), it dispatches on action.action_type:

describe -> calls _call_ollama_to_select_table() then _get_table_schema()
query -> calls _call_ollama_for_sql() (generates SQL but never executes it)
sample -> calls _call_ollama_to_select_table() then _generate_sample_query() (generates SQL string but never executes it)

No ANSWER action exists. No questions are loaded. No SQLite database connection exists. No step budget is tracked. The SQLObservation currently only carries messages and tokens (the richer fields like question, schema_info, result, error, step_count, budget_remaining, action_history are commented out in models.py).

Architecture Context

Agent (external) --WebSocket/HTTP--> FastAPI (server/app.py)
                                       |
                                       v
                              SQLEnvironment (server/sql_environment.py)
                                       |
                                 [MISSING: SQLite connection]
                                 [MISSING: Question loading]
                                 [MISSING: Episode context]
                                       |
                              models.py (SQLAction, SQLObservation, SQLState)
                              data/questions/student_assessment.json (53 Q&A pairs)
                              data/databases/models.py (9 ORM tables, no .sqlite file)

Entry Points

Entry Point	Trigger	Current Flow
`POST /reset` (via OpenEnv create_app)	Client calls reset	`SQLEnvironment.reset()` -> clears history, returns observation with system prompt
`POST /step` (via OpenEnv create_app)	Client sends action	`SQLEnvironment.step(action)` -> dispatches on action_type, calls Ollama, appends messages
`GET /state`	Client queries state	Returns `SQLState` (history_messages, history_tokens, current_action_type)
WebSocket `/ws`	Persistent connection	Same reset/step but over WS

Data Flow

Data	Source	Shape/Type	Destination
SQLAction	Agent via HTTP/WS	`{action_type: str, action_description: str, tokens: Tensor}`	`SQLEnvironment.step()`
SQLObservation	Environment	`{messages: list[Message], tokens: Tensor, done: bool, reward: float}`	Agent via HTTP/WS
Questions JSON	`data/questions/student_assessment.json`	`[{db_id, query, question, query_toks, query_toks_no_value, ...}]`	Loaded at reset() to pick an episode question
SQLite database	`data/databases/` (NOT YET PRESENT)	`.sqlite` file	Read-only connection per episode
ORM models	`data/databases/models.py`	9 SQLAlchemy classes	Used by `_get_table_schema()` for column introspection

3. Dependencies

Code We Depend On

Dependency	What We Use	Risk if Changed
`openenv.core.env_server.interfaces.Environment`	Base class: `reset(seed, episode_id, kwargs) -> ObsT`, `step(action, timeout_s, kwargs) -> ObsT`, `state` property	Signature changes break our overrides
`openenv.core.env_server.types.Action, Observation, State`	Pydantic base models for SQLAction, SQLObservation, SQLState	Field additions could conflict
`openenv.core.env_server.create_app`	FastAPI app factory that wires endpoints to our environment	N/A (stable)
`openenv.core.env_server.interfaces.ModelTokenizer`	Protocol for tokenizer (apply_chat_template, decode)	Only used for token history -- not needed for F001 core logic
SQLAlchemy ORM models (`data/databases/models.py`)	9 model classes for table introspection via `__table__.columns`	Column/table name drift breaks schema descriptions
`sqlite3` (stdlib)	Will be used for query execution	Stable
`torch`	Tensor operations for token history	Current coupling is heavy but out of F001 scope to change

Code That Depends On Us

Dependent	How They Use Us	Impact of Our Change
`server/app.py`	Creates `SQLEnvironment` via factory, passes to `create_app()`	Constructor signature change if we add params (e.g., questions_path, db_path)
`client.py` (`SQLEnvClient`)	`_step_payload()` serializes SQLAction, `_parse_result()` deserializes SQLObservation	If SQLObservation fields change (uncomment rich fields), client must be updated
`tests/test_smoke.py`	Tests reset(), step(), message_to_action(), schema introspection	Tests will need updating for new behavior (step now executes SQL, reset now loads question)

External Systems

System	Integration Point	Considerations
SQLite database files	`data/databases/*.sqlite`	Files do NOT exist yet. ORM models define schema but no .sqlite file is present. Need to either: (a) generate from ORM models via SQLAlchemy `create_all()` + seed data, or (b) download Spider database files
Spider dataset (HuggingFace)	`scripts/download_spider_data.py`	Downloads question JSON only. Does NOT download the actual SQLite database files. The student_assessment.json references `db_id: "student_assessment"` but no corresponding .sqlite exists
Ollama (BEING REMOVED)	`_call_ollama_to_select_table()`, `_call_ollama_for_sql()`	These will be deleted. No fallback needed -- the agent provides structured actions directly

4. Risks & Edge Cases

Identified Risks

Risk	Likelihood	Impact	Mitigation
No .sqlite database file exists	Certain	Environment cannot execute SQL at all	Must create/download database before this feature works. SQLAlchemy `Base.metadata.create_all(engine)` can create empty tables, but data seeding is needed for meaningful SAMPLE/QUERY results
Question JSON has no `gold_answer` field	High	Cannot implement ANSWER verification	Spider format has `query` (gold SQL) but not a pre-computed gold answer. Must either run gold SQL at reset() to compute answer, or defer ANSWER verification to Phase 2 (verifier.py)
SQLObservation field changes break client	Medium	Client deserialization fails	Update `SQLEnvClient._parse_result()` alongside observation changes
Existing tests assume Ollama-based step behavior	Medium	Tests break	Rewrite tests to use structured actions
SQL injection via QUERY action	Medium	Agent could run destructive SQL	Enforce SELECT-only parsing + read-only SQLite connection + 5s timeout
`message_to_action()` pipeline conflicts with structured actions	Medium	Two action creation paths (NL keyword detection vs. structured)	The `message_to_action()` + `_detect_action_type()` pipeline is designed for the Ollama-based flow. With structured actions, the agent sends `action_type` directly. Need to decide whether to keep/remove `message_to_action()`

Edge Cases to Handle

Edge Case	Current Behavior	Required Behavior
DESCRIBE with invalid table name	Ollama guesses a table	Return error: "Table 'xyz' not found. Available tables: ..."
QUERY with non-SELECT SQL (INSERT, DROP, etc.)	Never executed	Reject with clear error before execution
QUERY that times out (>5s)	Never executed	Kill query, return timeout error
QUERY returning >20 rows	Never executed	Truncate to 20 rows, indicate truncation
Budget exhausted (15 steps) without ANSWER	No budget tracking	Set done=True, reward=0, return termination observation
ANSWER action	Not implemented	Compare answer to gold, set done=True, compute reward
Empty/null action_description	Passed to Ollama	Validate and return error
reset() called mid-episode	Clears history only	Close SQLite connection, pick new question, open new connection
DESCRIBE "all"	Not handled distinctly	Return all table names (per v1 spec field description)

Invariants to Preserve

OpenEnv Environment interface contract: reset() returns ObsT, step() returns ObsT, state property returns StateT
Observation.done=True only on episode termination (ANSWER submitted or budget exhausted)
SQLite connection is always read-only (no writes possible)
Step budget decrements only on non-ANSWER actions
Agent never sees gold_sql or gold_answer in observations

4b. Code Shape & Design Target

Existing Vocabulary

Concept	Existing Name	Location
Agent action	`SQLAction` (action_type, action_description, tokens)	`models.py`
Environment response	`SQLObservation` (messages, tokens, done, reward)	`models.py`
Episode metadata	`SQLState` (history_messages, history_tokens, current_action_type)	`models.py`
Table introspection	`_get_table_schema(table_name)`	`server/sql_environment.py`
Type conversion	`_sqlalchemy_type_to_natural_language(col_type)`	`server/sql_environment.py`
Sample query generation	`_generate_sample_query(table_name, limit)`	`server/sql_environment.py`
Core environment	`SQLEnvironment(Environment[SQLAction, SQLObservation, SQLState])`	`server/sql_environment.py`
Per-episode state (conceptual)	`EpisodeContext` (commented design outline)	`models.py` lines 130-247
Question record (conceptual)	`QuestionRecord`	`models.py` lines 224-236
Answer verification (stub)	`server/verifier.py`	Placeholder only
Reward computation (stub)	`server/reward.py`	Placeholder only

Language/Framework Idioms

Pydantic models for all wire types (Action, Observation, State) -- follows OpenEnv pattern
SQLAlchemy ORM for schema definition, but NOT for query execution (agents run raw SQL)
FastAPI via OpenEnv's create_app() factory pattern
TypedDict for Message ({role: str, content: str})
Private methods prefixed with _ for internal helpers
Logging via logging.getLogger(__name__)
Type annotations throughout, including generics on Environment base class
torch.Tensor for token storage (inherited from OpenEnv pattern)
dataclass-style Pydantic with Field() for all model fields

Target Shape

Component	Purpose	Why This Boundary
`EpisodeContext` (dataclass or Pydantic)	Per-episode server state: db_connection, question, step_count, budget, described_tables, action_log	Already designed in models.py comments; isolates episode state from environment singleton
`_execute_sql(sql, timeout)`	Sandboxed SQL execution: SELECT-only check, read-only connection, timeout, truncation	Single responsibility; reused by both QUERY and internal gold-answer computation
`_handle_describe(table_name)`	Return schema for a specific table	Already exists as `_get_table_schema()`, just needs to use agent-provided table name directly instead of Ollama
`_handle_sample(table_name)`	Execute `SELECT * FROM table LIMIT N` via `_execute_sql()`	Already exists as `_generate_sample_query()`, needs to actually execute the SQL
`_handle_query(sql_string)`	Validate and execute agent-provided SQL	New; wraps `_execute_sql()` with SELECT-only validation
`_handle_answer(value)`	Compare to gold answer, set done=True, compute terminal reward	New; minimal for MVP (delegate to verifier.py later)
`_load_questions(path)`	Load and parse question JSON	Simple loader; called once at init or lazily
`_open_db(db_path)`	Open read-only SQLite connection with timeout	Called at reset(); isolated for testability
`_build_observation()`	Construct SQLObservation from episode context	Replace current `_create_observation()` which only handles messages/tokens

Abstraction Level

Current level: Mostly flat -- single class with private helper methods. No service layer, no repository pattern. The conceptual EpisodeContext is outlined but not implemented.
Recommendation: Stay flat. Add EpisodeContext as a dataclass to hold per-episode state (replacing the current self._state which mixes episode data with token history). Keep all action handlers as private methods on SQLEnvironment. Do not introduce a separate service class or handler classes.

Anti-Patterns to Avoid

Do not create separate handler classes (e.g., DescribeHandler, QueryHandler) -- this codebase uses private methods on the environment class
Do not over-abstract the SQL execution -- a single _execute_sql() method is sufficient; no need for a query builder or execution strategy pattern
Do not keep the message_to_action() + _detect_action_type() pipeline for the new structured flow -- these are artifacts of the Ollama-based NL interpretation. The agent now sends structured actions directly. However, if message_to_action() is part of the OpenEnv contract, it may need to be preserved as a thin adapter
Do not add reward computation in F001 -- server/reward.py is explicitly a Phase 3 stub. For F001, use simple terminal reward (1.0 for correct ANSWER, 0.0 otherwise)
Avoid a single 200-line step() method -- dispatch to _handle_describe(), _handle_sample(), _handle_query(), _handle_answer() and keep each under 30 lines

5. Constraints

Technical Constraints

Constraint	Requirement	Notes
Query execution latency	< 100ms	SQLite on local disk with small Spider databases; should be trivial
Query timeout	5 seconds max	Use `sqlite3.Connection.set_progress_handler()` or execute in a thread with timeout
Read-only access	No writes to database	Open SQLite with `?mode=ro` URI or use `PRAGMA query_only = ON`
SELECT-only queries	Block INSERT/UPDATE/DELETE/DROP/CREATE/ALTER	Parse SQL prefix before execution
Output truncation	Max 20 rows	Truncate result set, add "... (N more rows)" indicator
Step budget	15 steps per episode (configurable)	Decrement on non-ANSWER actions

Pattern Constraints

Must implement reset(seed, episode_id, **kwargs) -> SQLObservation matching OpenEnv base signature
Must implement step(action, timeout_s, **kwargs) -> SQLObservation matching OpenEnv base signature
Must maintain state property returning SQLState
Pydantic models must remain serializable over HTTP/WebSocket (no raw sqlite3 objects in observations)
Constructor must remain compatible with create_sql_environment() factory in app.py

Testing Constraints

Test Suite	Coverage Area	Notes
`tests/test_smoke.py`	Models, environment reset/step, action detection, message_to_action, client serialization, schema introspection	6 test classes, ~20 tests. Several will break: step tests assume Ollama-based behavior; reset tests assume no question loading. `TestActionDetection` and `TestMessageToAction` test NL keyword detection which may be deprecated

6. Open Questions

Question	Why It Matters	Who Can Answer
Where do the SQLite database files come from?	No .sqlite files exist. ORM models define schema but Spider databases are separate artifacts. `download_spider_data.py` only downloads question JSON, not databases	Developer decision: generate from ORM + seed, or download Spider DBs
Should `action_description` carry the structured argument, or should we add a dedicated `argument` field to SQLAction?	v1 spec defines `argument: str` as a separate field. Current code uses `action_description` for NL text. Using `action_description` avoids a model change but is semantically misleading	Developer (API design decision)
What happens to `message_to_action()` and `_detect_action_type()`?	These convert NL messages to actions using keyword matching. With structured actions, the agent sends action_type directly. But `message_to_action` might be part of the OpenEnv contract	Developer + OpenEnv docs
Should SQLObservation fields be uncommented (question, schema_info, result, error, etc.) or should we continue using messages-only?	The v1 spec and the commented-out fields describe a rich observation. The current implementation uses only messages + tokens. Rich fields make the API cleaner for RL agents	Developer (observation design decision)

7. Context Sources

Source	Type	Notes
`server/sql_environment.py`	Code	Main environment: 546 lines, Ollama-dependent step(), no SQL execution
`models.py`	Code	Wire types + conceptual EpisodeContext design (commented)
`server/app.py`	Code	FastAPI factory, tokenizer setup
`data/databases/models.py`	Code	9 SQLAlchemy ORM tables (student_assessment schema)
`data/questions/student_assessment.json`	Data	53 Spider questions with gold SQL, db_id, tokenized queries
`docs_draft/SQLEnv_Concept_v1.md`	Spec	V1 design: action space, episode lifecycle, reward architecture, anti-gaming
`server/verifier.py`	Code	Stub -- placeholder for Phase 2 answer verification
`server/reward.py`	Code	Stub -- placeholder for Phase 3 reward computation
`scripts/download_spider_data.py`	Code	Downloads question JSON from HuggingFace Spider dataset (not databases)
`tests/test_smoke.py`	Code	6 test classes, ~20 tests covering models, env, action detection, client
`server/test_sql_env.py`	Code	MockTokenizer for testing without transformers
`client.py`	Code	SQLEnvClient with _step_payload() and _parse_result()
OpenEnv `interfaces.py`	Code (vendored)	Environment base class: reset(seed, episode_id), step(action, timeout_s), state property
OpenEnv `types.py`	Code (vendored)	Action, Observation, State Pydantic base models

Human Validation Checkpoint

Before proceeding to planning, please confirm:

System context is accurate
Dependencies are complete
Risks are identified
Constraints are correct
Open questions can be resolved

Questions for reviewer:

Is anything incorrect or missing?
Are there risks I haven't identified?
Should we proceed to planning?

Validated by: [NAME] on [DATE]