# Research Summary **Project:** sql-env **Change:** F001 - Core Environment Loop (step/reset lifecycle with structured actions, SQLite execution, sandboxing, question loading, step budget) **Date:** 2026-03-24 **Status:** Draft --- ## 1. Change Overview ### What We're Changing Complete the step/reset lifecycle so the environment actually executes SQL. Currently `step()` delegates to Ollama for action interpretation and SQL generation -- the environment never touches SQLite. This feature removes the Ollama dependency from step(), accepts structured actions (DESCRIBE table_name, SAMPLE table_name, QUERY sql_string, ANSWER value), wires up SQLite execution with sandboxing (read-only, 5s timeout, SELECT-only), loads questions from JSON on reset(), enforces a 15-step budget, and handles episode termination. ### Why We're Changing It The environment is architecturally broken for RL use. In an RL environment, the AGENT generates actions and the ENVIRONMENT executes them deterministically. Currently the environment calls Ollama inside `step()` to (a) select which table to DESCRIBE and (b) generate SQL for QUERY actions. This makes the environment non-deterministic and couples it to an external LLM service. The v1 spec defines structured actions where the agent provides table names and SQL directly. ### Success Criteria - Agent sends `DESCRIBE employees` and immediately sees column names and types - Queries execute in <100ms with clean truncated output (max 20 rows) - Bad SQL returns a clear error message the agent can learn from - Episode ends cleanly when budget exhausted or ANSWER submitted - No Ollama dependency in the environment's step/reset path --- ## 2. System Context ### Current Behavior `SQLEnvironment` inherits from OpenEnv `Environment[SQLAction, SQLObservation, SQLState]`. On reset(), it clears history_messages/history_tokens and re-adds the system prompt. On step(), it dispatches on `action.action_type`: - `describe` -> calls `_call_ollama_to_select_table()` then `_get_table_schema()` - `query` -> calls `_call_ollama_for_sql()` (generates SQL but never executes it) - `sample` -> calls `_call_ollama_to_select_table()` then `_generate_sample_query()` (generates SQL string but never executes it) No ANSWER action exists. No questions are loaded. No SQLite database connection exists. No step budget is tracked. The `SQLObservation` currently only carries `messages` and `tokens` (the richer fields like `question`, `schema_info`, `result`, `error`, `step_count`, `budget_remaining`, `action_history` are commented out in models.py). ### Architecture Context ``` Agent (external) --WebSocket/HTTP--> FastAPI (server/app.py) | v SQLEnvironment (server/sql_environment.py) | [MISSING: SQLite connection] [MISSING: Question loading] [MISSING: Episode context] | models.py (SQLAction, SQLObservation, SQLState) data/questions/student_assessment.json (53 Q&A pairs) data/databases/models.py (9 ORM tables, no .sqlite file) ``` ### Entry Points | Entry Point | Trigger | Current Flow | |-------------|---------|--------------| | `POST /reset` (via OpenEnv create_app) | Client calls reset | `SQLEnvironment.reset()` -> clears history, returns observation with system prompt | | `POST /step` (via OpenEnv create_app) | Client sends action | `SQLEnvironment.step(action)` -> dispatches on action_type, calls Ollama, appends messages | | `GET /state` | Client queries state | Returns `SQLState` (history_messages, history_tokens, current_action_type) | | WebSocket `/ws` | Persistent connection | Same reset/step but over WS | ### Data Flow | Data | Source | Shape/Type | Destination | |------|--------|------------|-------------| | SQLAction | Agent via HTTP/WS | `{action_type: str, action_description: str, tokens: Tensor}` | `SQLEnvironment.step()` | | SQLObservation | Environment | `{messages: list[Message], tokens: Tensor, done: bool, reward: float}` | Agent via HTTP/WS | | Questions JSON | `data/questions/student_assessment.json` | `[{db_id, query, question, query_toks, query_toks_no_value, ...}]` | Loaded at reset() to pick an episode question | | SQLite database | `data/databases/` (NOT YET PRESENT) | `.sqlite` file | Read-only connection per episode | | ORM models | `data/databases/models.py` | 9 SQLAlchemy classes | Used by `_get_table_schema()` for column introspection | --- ## 3. Dependencies ### Code We Depend On | Dependency | What We Use | Risk if Changed | |------------|-------------|-----------------| | `openenv.core.env_server.interfaces.Environment` | Base class: `reset(seed, episode_id, **kwargs) -> ObsT`, `step(action, timeout_s, **kwargs) -> ObsT`, `state` property | Signature changes break our overrides | | `openenv.core.env_server.types.Action, Observation, State` | Pydantic base models for SQLAction, SQLObservation, SQLState | Field additions could conflict | | `openenv.core.env_server.create_app` | FastAPI app factory that wires endpoints to our environment | N/A (stable) | | `openenv.core.env_server.interfaces.ModelTokenizer` | Protocol for tokenizer (apply_chat_template, decode) | Only used for token history -- not needed for F001 core logic | | SQLAlchemy ORM models (`data/databases/models.py`) | 9 model classes for table introspection via `__table__.columns` | Column/table name drift breaks schema descriptions | | `sqlite3` (stdlib) | Will be used for query execution | Stable | | `torch` | Tensor operations for token history | Current coupling is heavy but out of F001 scope to change | ### Code That Depends On Us | Dependent | How They Use Us | Impact of Our Change | |-----------|-----------------|---------------------| | `server/app.py` | Creates `SQLEnvironment` via factory, passes to `create_app()` | Constructor signature change if we add params (e.g., questions_path, db_path) | | `client.py` (`SQLEnvClient`) | `_step_payload()` serializes SQLAction, `_parse_result()` deserializes SQLObservation | If SQLObservation fields change (uncomment rich fields), client must be updated | | `tests/test_smoke.py` | Tests reset(), step(), message_to_action(), schema introspection | Tests will need updating for new behavior (step now executes SQL, reset now loads question) | ### External Systems | System | Integration Point | Considerations | |--------|-------------------|----------------| | SQLite database files | `data/databases/*.sqlite` | Files do NOT exist yet. ORM models define schema but no .sqlite file is present. Need to either: (a) generate from ORM models via SQLAlchemy `create_all()` + seed data, or (b) download Spider database files | | Spider dataset (HuggingFace) | `scripts/download_spider_data.py` | Downloads question JSON only. Does NOT download the actual SQLite database files. The student_assessment.json references `db_id: "student_assessment"` but no corresponding .sqlite exists | | Ollama (BEING REMOVED) | `_call_ollama_to_select_table()`, `_call_ollama_for_sql()` | These will be deleted. No fallback needed -- the agent provides structured actions directly | --- ## 4. Risks & Edge Cases ### Identified Risks | Risk | Likelihood | Impact | Mitigation | |------|------------|--------|------------| | No .sqlite database file exists | Certain | Environment cannot execute SQL at all | Must create/download database before this feature works. SQLAlchemy `Base.metadata.create_all(engine)` can create empty tables, but data seeding is needed for meaningful SAMPLE/QUERY results | | Question JSON has no `gold_answer` field | High | Cannot implement ANSWER verification | Spider format has `query` (gold SQL) but not a pre-computed gold answer. Must either run gold SQL at reset() to compute answer, or defer ANSWER verification to Phase 2 (verifier.py) | | SQLObservation field changes break client | Medium | Client deserialization fails | Update `SQLEnvClient._parse_result()` alongside observation changes | | Existing tests assume Ollama-based step behavior | Medium | Tests break | Rewrite tests to use structured actions | | SQL injection via QUERY action | Medium | Agent could run destructive SQL | Enforce SELECT-only parsing + read-only SQLite connection + 5s timeout | | `message_to_action()` pipeline conflicts with structured actions | Medium | Two action creation paths (NL keyword detection vs. structured) | The `message_to_action()` + `_detect_action_type()` pipeline is designed for the Ollama-based flow. With structured actions, the agent sends `action_type` directly. Need to decide whether to keep/remove `message_to_action()` | ### Edge Cases to Handle | Edge Case | Current Behavior | Required Behavior | |-----------|------------------|-------------------| | DESCRIBE with invalid table name | Ollama guesses a table | Return error: "Table 'xyz' not found. Available tables: ..." | | QUERY with non-SELECT SQL (INSERT, DROP, etc.) | Never executed | Reject with clear error before execution | | QUERY that times out (>5s) | Never executed | Kill query, return timeout error | | QUERY returning >20 rows | Never executed | Truncate to 20 rows, indicate truncation | | Budget exhausted (15 steps) without ANSWER | No budget tracking | Set done=True, reward=0, return termination observation | | ANSWER action | Not implemented | Compare answer to gold, set done=True, compute reward | | Empty/null action_description | Passed to Ollama | Validate and return error | | reset() called mid-episode | Clears history only | Close SQLite connection, pick new question, open new connection | | DESCRIBE "all" | Not handled distinctly | Return all table names (per v1 spec field description) | ### Invariants to Preserve - [ ] OpenEnv Environment interface contract: reset() returns ObsT, step() returns ObsT, state property returns StateT - [ ] Observation.done=True only on episode termination (ANSWER submitted or budget exhausted) - [ ] SQLite connection is always read-only (no writes possible) - [ ] Step budget decrements only on non-ANSWER actions - [ ] Agent never sees gold_sql or gold_answer in observations --- ## 4b. Code Shape & Design Target ### Existing Vocabulary | Concept | Existing Name | Location | |---------|---------------|----------| | Agent action | `SQLAction` (action_type, action_description, tokens) | `models.py` | | Environment response | `SQLObservation` (messages, tokens, done, reward) | `models.py` | | Episode metadata | `SQLState` (history_messages, history_tokens, current_action_type) | `models.py` | | Table introspection | `_get_table_schema(table_name)` | `server/sql_environment.py` | | Type conversion | `_sqlalchemy_type_to_natural_language(col_type)` | `server/sql_environment.py` | | Sample query generation | `_generate_sample_query(table_name, limit)` | `server/sql_environment.py` | | Core environment | `SQLEnvironment(Environment[SQLAction, SQLObservation, SQLState])` | `server/sql_environment.py` | | Per-episode state (conceptual) | `EpisodeContext` (commented design outline) | `models.py` lines 130-247 | | Question record (conceptual) | `QuestionRecord` | `models.py` lines 224-236 | | Answer verification (stub) | `server/verifier.py` | Placeholder only | | Reward computation (stub) | `server/reward.py` | Placeholder only | ### Language/Framework Idioms - **Pydantic models** for all wire types (Action, Observation, State) -- follows OpenEnv pattern - **SQLAlchemy ORM** for schema definition, but NOT for query execution (agents run raw SQL) - **FastAPI** via OpenEnv's `create_app()` factory pattern - **TypedDict** for Message (`{role: str, content: str}`) - **Private methods** prefixed with `_` for internal helpers - **Logging** via `logging.getLogger(__name__)` - **Type annotations** throughout, including generics on Environment base class - **torch.Tensor** for token storage (inherited from OpenEnv pattern) - **dataclass-style Pydantic** with Field() for all model fields ### Target Shape | Component | Purpose | Why This Boundary | |-----------|---------|-------------------| | `EpisodeContext` (dataclass or Pydantic) | Per-episode server state: db_connection, question, step_count, budget, described_tables, action_log | Already designed in models.py comments; isolates episode state from environment singleton | | `_execute_sql(sql, timeout)` | Sandboxed SQL execution: SELECT-only check, read-only connection, timeout, truncation | Single responsibility; reused by both QUERY and internal gold-answer computation | | `_handle_describe(table_name)` | Return schema for a specific table | Already exists as `_get_table_schema()`, just needs to use agent-provided table name directly instead of Ollama | | `_handle_sample(table_name)` | Execute `SELECT * FROM table LIMIT N` via `_execute_sql()` | Already exists as `_generate_sample_query()`, needs to actually execute the SQL | | `_handle_query(sql_string)` | Validate and execute agent-provided SQL | New; wraps `_execute_sql()` with SELECT-only validation | | `_handle_answer(value)` | Compare to gold answer, set done=True, compute terminal reward | New; minimal for MVP (delegate to verifier.py later) | | `_load_questions(path)` | Load and parse question JSON | Simple loader; called once at init or lazily | | `_open_db(db_path)` | Open read-only SQLite connection with timeout | Called at reset(); isolated for testability | | `_build_observation()` | Construct SQLObservation from episode context | Replace current `_create_observation()` which only handles messages/tokens | ### Abstraction Level - **Current level:** Mostly flat -- single class with private helper methods. No service layer, no repository pattern. The conceptual `EpisodeContext` is outlined but not implemented. - **Recommendation:** Stay flat. Add `EpisodeContext` as a dataclass to hold per-episode state (replacing the current `self._state` which mixes episode data with token history). Keep all action handlers as private methods on `SQLEnvironment`. Do not introduce a separate service class or handler classes. ### Anti-Patterns to Avoid - **Do not create separate handler classes** (e.g., DescribeHandler, QueryHandler) -- this codebase uses private methods on the environment class - **Do not over-abstract the SQL execution** -- a single `_execute_sql()` method is sufficient; no need for a query builder or execution strategy pattern - **Do not keep the `message_to_action()` + `_detect_action_type()` pipeline for the new structured flow** -- these are artifacts of the Ollama-based NL interpretation. The agent now sends structured actions directly. However, if `message_to_action()` is part of the OpenEnv contract, it may need to be preserved as a thin adapter - **Do not add reward computation in F001** -- `server/reward.py` is explicitly a Phase 3 stub. For F001, use simple terminal reward (1.0 for correct ANSWER, 0.0 otherwise) - **Avoid a single 200-line step() method** -- dispatch to `_handle_describe()`, `_handle_sample()`, `_handle_query()`, `_handle_answer()` and keep each under 30 lines --- ## 5. Constraints ### Technical Constraints | Constraint | Requirement | Notes | |------------|-------------|-------| | Query execution latency | < 100ms | SQLite on local disk with small Spider databases; should be trivial | | Query timeout | 5 seconds max | Use `sqlite3.Connection.set_progress_handler()` or execute in a thread with timeout | | Read-only access | No writes to database | Open SQLite with `?mode=ro` URI or use `PRAGMA query_only = ON` | | SELECT-only queries | Block INSERT/UPDATE/DELETE/DROP/CREATE/ALTER | Parse SQL prefix before execution | | Output truncation | Max 20 rows | Truncate result set, add "... (N more rows)" indicator | | Step budget | 15 steps per episode (configurable) | Decrement on non-ANSWER actions | ### Pattern Constraints - Must implement `reset(seed, episode_id, **kwargs) -> SQLObservation` matching OpenEnv base signature - Must implement `step(action, timeout_s, **kwargs) -> SQLObservation` matching OpenEnv base signature - Must maintain `state` property returning `SQLState` - Pydantic models must remain serializable over HTTP/WebSocket (no raw sqlite3 objects in observations) - Constructor must remain compatible with `create_sql_environment()` factory in `app.py` ### Testing Constraints | Test Suite | Coverage Area | Notes | |------------|---------------|-------| | `tests/test_smoke.py` | Models, environment reset/step, action detection, message_to_action, client serialization, schema introspection | 6 test classes, ~20 tests. Several will break: step tests assume Ollama-based behavior; reset tests assume no question loading. `TestActionDetection` and `TestMessageToAction` test NL keyword detection which may be deprecated | --- ## 6. Open Questions | Question | Why It Matters | Who Can Answer | |----------|----------------|----------------| | Where do the SQLite database files come from? | No .sqlite files exist. ORM models define schema but Spider databases are separate artifacts. `download_spider_data.py` only downloads question JSON, not databases | Developer decision: generate from ORM + seed, or download Spider DBs | | Should `action_description` carry the structured argument, or should we add a dedicated `argument` field to SQLAction? | v1 spec defines `argument: str` as a separate field. Current code uses `action_description` for NL text. Using `action_description` avoids a model change but is semantically misleading | Developer (API design decision) | | What happens to `message_to_action()` and `_detect_action_type()`? | These convert NL messages to actions using keyword matching. With structured actions, the agent sends action_type directly. But `message_to_action` might be part of the OpenEnv contract | Developer + OpenEnv docs | | Should SQLObservation fields be uncommented (question, schema_info, result, error, etc.) or should we continue using messages-only? | The v1 spec and the commented-out fields describe a rich observation. The current implementation uses only messages + tokens. Rich fields make the API cleaner for RL agents | Developer (observation design decision) | --- ## 7. Context Sources | Source | Type | Notes | |--------|------|-------| | `server/sql_environment.py` | Code | Main environment: 546 lines, Ollama-dependent step(), no SQL execution | | `models.py` | Code | Wire types + conceptual EpisodeContext design (commented) | | `server/app.py` | Code | FastAPI factory, tokenizer setup | | `data/databases/models.py` | Code | 9 SQLAlchemy ORM tables (student_assessment schema) | | `data/questions/student_assessment.json` | Data | 53 Spider questions with gold SQL, db_id, tokenized queries | | `docs_draft/SQLEnv_Concept_v1.md` | Spec | V1 design: action space, episode lifecycle, reward architecture, anti-gaming | | `server/verifier.py` | Code | Stub -- placeholder for Phase 2 answer verification | | `server/reward.py` | Code | Stub -- placeholder for Phase 3 reward computation | | `scripts/download_spider_data.py` | Code | Downloads question JSON from HuggingFace Spider dataset (not databases) | | `tests/test_smoke.py` | Code | 6 test classes, ~20 tests covering models, env, action detection, client | | `server/test_sql_env.py` | Code | MockTokenizer for testing without transformers | | `client.py` | Code | SQLEnvClient with _step_payload() and _parse_result() | | OpenEnv `interfaces.py` | Code (vendored) | Environment base class: reset(seed, episode_id), step(action, timeout_s), state property | | OpenEnv `types.py` | Code (vendored) | Action, Observation, State Pydantic base models | --- ## Human Validation Checkpoint **Before proceeding to planning, please confirm:** - [ ] System context is accurate - [ ] Dependencies are complete - [ ] Risks are identified - [ ] Constraints are correct - [ ] Open questions can be resolved **Questions for reviewer:** 1. Is anything incorrect or missing? 2. Are there risks I haven't identified? 3. Should we proceed to planning? --- *Validated by: [NAME] on [DATE]*