sql_env / specs /F001-VERIFICATION_SPEC.md
hjerpe's picture
Upload folder using huggingface_hub
5dd1bb4 verified

Verification Specification

Feature: F001 Generated from: specs/F001-VERIFICATION_INPUT.json Generated: 2026-03-24


1. Unit Tests

1.1 SQLAction Type

Test Description Input Expected Category
test_sqlaction_valid_describe Create action with DESCRIBE type SQLAction(action_type="DESCRIBE", argument="employees") Fields set correctly happy
test_sqlaction_valid_sample Create action with SAMPLE type SQLAction(action_type="SAMPLE", argument="orders") Fields set correctly happy
test_sqlaction_valid_query Create action with QUERY type SQLAction(action_type="QUERY", argument="SELECT 1") Fields set correctly happy
test_sqlaction_valid_answer Create action with ANSWER type SQLAction(action_type="ANSWER", argument="42") Fields set correctly happy
test_sqlaction_empty_argument Argument is empty string SQLAction(action_type="QUERY", argument="") Accepted at type level (validation at step) edge
test_sqlaction_whitespace_argument Argument is only whitespace SQLAction(action_type="QUERY", argument=" ") Accepted at type level edge
test_sqlaction_serialization Round-trip JSON serialization Create, serialize, deserialize Identical fields happy

1.2 SQLObservation Type

Test Description Input Expected Category
test_obs_all_fields_present All required fields populated Full observation construction All fields accessible with correct types happy
test_obs_done_false_initial Initial observation has done=False After reset done == False happy
test_obs_reward_none_nonterminal Non-terminal obs has reward=None After non-ANSWER step reward is None happy
test_obs_reward_set_terminal Terminal obs has numeric reward After ANSWER step reward in {0.0, 1.0} happy
test_obs_step_count_type step_count is int Any observation isinstance(step_count, int) happy
test_obs_budget_remaining_type budget_remaining is int Any observation isinstance(budget_remaining, int) happy
test_obs_action_history_list action_history is a list of strings After several steps isinstance(action_history, list) and all elements are str happy
test_obs_error_empty_on_success error is empty string on success After successful action error == "" happy
test_obs_schema_info_nonempty schema_info is non-empty after reset After reset len(schema_info) > 0 happy

1.3 QuestionRecord Type

Test Description Input Expected Category
test_qr_all_fields All fields populated from valid JSON Parsed question JSON All 8 fields present and correct types happy
test_qr_difficulty_values difficulty in allowed set Various records difficulty in {"easy", "medium", "hard"} happy
test_qr_answer_type_values answer_type in allowed set Various records answer_type in {"integer", "float", "string", "list"} happy
test_qr_tables_involved_nonempty tables_involved has at least one entry Valid question len(tables_involved) >= 1 happy

1.4 EpisodeContext Type

Test Description Input Expected Category
test_ctx_initial_step_count step_count starts at 0 After reset step_count == 0 happy
test_ctx_initial_budget budget starts at configured value (default 15) After reset budget == 15 happy
test_ctx_described_tables_empty described_tables starts empty After reset len(described_tables) == 0 happy
test_ctx_action_log_empty action_log starts empty After reset len(action_log) == 0 happy
test_ctx_done_false_initial done starts False After reset done == False happy
test_ctx_gold_answer_computed gold_answer is set after reset After reset gold_answer is not None happy

1.5 SQLEnvironment.init

Test Description Input Expected Category
test_init_valid Init with valid paths loads questions Valid questions_path, db_dir len(questions) > 0 happy
test_init_missing_questions_file questions_path does not exist "/nonexistent/q.json" FileNotFoundError error
test_init_missing_db_dir db_dir does not exist "/nonexistent/dbs/" FileNotFoundError or ValueError error
test_init_invalid_json questions file is not valid JSON Path to file with "{bad" ValueError error
test_init_empty_questions questions file is empty array [] Path to file with "[]" ValueError (no questions) error
test_init_custom_budget Custom step_budget is stored step_budget=10 Environment uses 10-step budget happy
test_init_default_budget Default step_budget is 15 Omit step_budget Budget is 15 happy

1.6 SQLEnvironment.reset

Test Description Input Expected Category
test_reset_returns_observation Returns SQLObservation reset() isinstance(result, SQLObservation) happy
test_reset_obs_has_question Observation contains question text reset() len(obs.question) > 0 happy
test_reset_obs_has_table_names schema_info contains table names reset() schema_info mentions at least one table happy
test_reset_obs_no_columns schema_info does NOT reveal columns initially reset() No column names in schema_info happy
test_reset_obs_done_false done is False reset() obs.done == False happy
test_reset_obs_reward_none reward is None reset() obs.reward is None happy
test_reset_obs_step_count_zero step_count is 0 reset() obs.step_count == 0 happy
test_reset_obs_budget_full budget_remaining equals configured budget reset() obs.budget_remaining == 15 happy
test_reset_obs_empty_history action_history is empty reset() obs.action_history == [] happy
test_reset_obs_empty_error error is empty string reset() obs.error == "" happy
test_reset_seed_determinism Same seed yields same question reset(seed=42) twice Same question both times happy
test_reset_different_seeds Different seeds can yield different questions reset(seed=1) vs reset(seed=999) Not necessarily same question (probabilistic) happy
test_reset_no_seed_random Without seed, question is randomly selected reset() multiple times At least one different question (probabilistic) happy
test_reset_episode_id_set Custom episode_id is reflected reset(episode_id="ep-123") Context has episode_id="ep-123" happy
test_reset_missing_db_file Question references nonexistent DB Question with bad db_name FileNotFoundError error
test_reset_clears_previous_episode Calling reset mid-episode starts fresh reset, step, reset Second reset has step_count=0, empty history happy

1.7 SQLEnvironment.step

Test Description Input Expected Category
test_step_returns_observation Returns SQLObservation Any valid action isinstance(result, SQLObservation) happy
test_step_never_raises step never raises exceptions Various invalid inputs Returns obs with error field, no exception happy
test_step_increments_step_count step_count increases by 1 DESCRIBE action obs.step_count == 1 happy
test_step_decrements_budget budget_remaining decreases by 1 DESCRIBE/SAMPLE/QUERY obs.budget_remaining == 14 happy
test_step_answer_no_budget_decrement ANSWER does not decrement budget ANSWER action budget unchanged from before happy
test_step_appends_action_history action_history grows each step Two DESCRIBE actions len(obs.action_history) == 2 happy
test_step_invalid_action_type Unknown action_type returns error action_type="UNKNOWN" obs.error contains "Unknown action type" and lists valid types error
test_step_empty_argument Empty argument returns error argument="" obs.error contains "cannot be empty" error
test_step_whitespace_argument Whitespace-only argument returns error argument=" " obs.error contains "cannot be empty" error

1.8 SQLEnvironment._execute_sql

Test Description Input Expected Category
test_exec_valid_select Executes a valid SELECT "SELECT 1" Returns list with results happy
test_exec_non_select_rejected Non-SELECT is rejected "DROP TABLE x" ValueError raised error
test_exec_insert_rejected INSERT is rejected "INSERT INTO t VALUES(1)" ValueError raised error
test_exec_update_rejected UPDATE is rejected "UPDATE t SET x=1" ValueError raised error
test_exec_delete_rejected DELETE is rejected "DELETE FROM t" ValueError raised error
test_exec_create_rejected CREATE is rejected "CREATE TABLE t(x INT)" ValueError raised error
test_exec_semicolon_multi Multiple statements rejected "SELECT 1; DROP TABLE t" Error or only first executed safely error
test_exec_syntax_error Malformed SQL "SELCET * FORM t" sqlite3.OperationalError error
test_exec_timeout Query exceeding timeout is interrupted Long-running query (e.g., cartesian join) Error raised within ~5 seconds error
test_exec_result_truncation Results truncated to 20 rows Query returning 100 rows len(result) <= 20 edge
test_exec_empty_result Query returning no rows "SELECT * FROM t WHERE 1=0" Empty list [] edge
test_exec_read_only Connection is truly read-only Attempt write via raw SQL Error raised error
test_exec_case_insensitive_select select (lowercase) is accepted "select 1" Returns results edge
test_exec_select_with_leading_whitespace Leading whitespace before SELECT " SELECT 1" Returns results edge
test_exec_select_with_comment SQL comment before SELECT "-- comment\nSELECT 1" Handled correctly (accepted or rejected consistently) edge

1.9 SQLEnvironment._handle_describe

Test Description Input Expected Category
test_describe_valid_table Known table returns schema info Existing table name Result contains column names, types, row count happy
test_describe_unknown_table Unknown table returns error with list "nonexistent_table" Error containing "not found" and listing available tables error
test_describe_case_sensitivity Table name matching case behavior Mixed case table name Consistent behavior (either matches or provides helpful error) edge
test_describe_updates_described_set Described table is tracked DESCRIBE a table Table appears in described_tables set happy
test_describe_repeated_table Describing same table twice works DESCRIBE same table twice No error, returns same info edge

1.10 SQLEnvironment._handle_sample

Test Description Input Expected Category
test_sample_valid_table Known table returns sample rows Existing table name Formatted rows returned happy
test_sample_unknown_table Unknown table returns error "nonexistent_table" Error with available table list error
test_sample_default_limit Default limit is 5 rows Omit limit parameter At most 5 rows returned happy
test_sample_empty_table Table with no rows Empty table Returns empty or "no rows" result edge

1.11 SQLEnvironment._handle_query

Test Description Input Expected Category
test_query_valid_select Valid SELECT returns formatted results "SELECT * FROM t LIMIT 3" Formatted result string happy
test_query_non_select Non-SELECT returns error "DROP TABLE t" Error about SELECT-only error
test_query_syntax_error Bad SQL returns error "SELCET *" Error with sqlite3 message error
test_query_timeout Slow query returns timeout error Expensive query Error mentioning "timed out" and "5.0 seconds" error
test_query_truncation_indicator >20 rows shows truncation notice Query returning many rows Result indicates truncation occurred edge
test_query_exactly_20_rows Exactly 20 rows shows no truncation Query returning 20 rows No truncation indicator edge

1.12 SQLEnvironment._handle_answer

Test Description Input Expected Category
test_answer_correct Correct answer yields reward 1.0 Gold answer value (True, 1.0) happy
test_answer_incorrect Incorrect answer yields reward 0.0 Wrong value (False, 0.0) happy
test_answer_case_insensitive Case-insensitive comparison "PARIS" vs gold "paris" (True, 1.0) edge
test_answer_whitespace_stripped Leading/trailing whitespace stripped " 42 " vs gold "42" (True, 1.0) edge
test_answer_sets_done Episode is marked done after answer Any answer done == True happy
test_answer_empty_string Empty answer handled "" (False, 0.0) edge

1.13 SQLEnvironment._load_questions

Test Description Input Expected Category
test_load_valid_json Valid questions JSON parsed Path to valid file list[QuestionRecord] with correct count happy
test_load_file_not_found Missing file raises error "/nonexistent.json" FileNotFoundError error
test_load_invalid_json Malformed JSON raises error Path to "{bad" ValueError error
test_load_missing_fields Record missing required field JSON with missing gold_sql ValueError or KeyError error

1.14 SQLEnvironment._open_db

Test Description Input Expected Category
test_open_valid_db Opens existing database Valid db_name sqlite3.Connection returned happy
test_open_missing_db Missing database raises error "nonexistent_db" FileNotFoundError error
test_open_read_only Connection is read-only Valid db_name Write operations fail happy

1.15 SQLEnvironment._build_observation

Test Description Input Expected Category
test_build_obs_reflects_context Observation matches episode context Various context states Fields match context values happy
test_build_obs_schema_progressive schema_info includes described columns After DESCRIBE action schema_info contains column details for described tables happy
test_build_obs_schema_initial Initial schema_info has only table names Before any DESCRIBE Only table names, no columns happy

Run: pytest tests/unit/test_sql_environment.py -v


2. Integration Tests

Flow: Full Episode Lifecycle (Reset-Explore-Answer)

Step Action Expected Verification
1 reset(seed=42) Observation with question, table names, step=0, budget=15 Assert all initial observation fields
2 step(DESCRIBE, "employees") Column info in result, step=1, budget=14 Assert result contains column names, step incremented
3 step(SAMPLE, "employees") Sample rows in result, step=2, budget=13 Assert formatted rows in result
4 step(QUERY, "SELECT COUNT(*) FROM employees") Count result, step=3, budget=12 Assert numeric result in output
5 step(ANSWER, "<correct_value>") done=True, reward=1.0 Assert terminal state with positive reward
6 Verify no more steps accepted Post-episode step attempt handled gracefully Assert error or episode-over signal

Flow: Budget Exhaustion

Step Action Expected Verification
1 reset() with budget=3 Observation with budget=3 Assert budget_remaining==3
2 step(DESCRIBE, "t1") budget=2 Assert budget_remaining==2
3 step(DESCRIBE, "t2") budget=1 Assert budget_remaining==1
4 step(DESCRIBE, "t3") budget=0, done=True, reward=0.0 Assert terminal state, zero reward

Flow: Error Recovery Within Episode

Step Action Expected Verification
1 reset() Normal initial observation Assert done==False
2 step(QUERY, "DROP TABLE x") Error in obs.error, done=False Assert error message, episode continues
3 step(QUERY, "SELCET * FORM t") Error in obs.error, done=False Assert SQL error message, episode continues
4 step(DESCRIBE, "nonexistent") Error in obs.error, done=False Assert "not found" error, episode continues
5 step(QUERY, "SELECT 1") Success result, no error Assert error=="" and result has data

Flow: Progressive Schema Discovery

Step Action Expected Verification
1 reset() schema_info has table names only Assert table names present, no column details
2 step(DESCRIBE, "table_a") schema_info now includes table_a columns Assert column names for table_a appear
3 step(DESCRIBE, "table_b") schema_info includes both table_a and table_b columns Assert both tables' columns present

Flow: Seed Determinism

Step Action Expected Verification
1 reset(seed=42) Observation A Record question and db
2 reset(seed=42) Observation B identical to A Assert same question_text and schema_info

Run: pytest tests/integration/test_episode_lifecycle.py -v


3. API Tests

POST /reset

Test Request Status Response Category
test_reset_no_params {} 200 SQLObservation with done=False, budget=15 happy
test_reset_with_seed {"seed": 42} 200 Deterministic SQLObservation happy
test_reset_with_episode_id {"episode_id": "ep-1"} 200 SQLObservation (episode_id reflected in context) happy
test_reset_null_seed {"seed": null} 200 Random question selected happy
test_reset_server_error_missing_db (DB file removed) 500 Error response indicating database not found error
test_reset_response_schema {} 200 Response contains all SQLObservation fields happy

POST /step

Test Request Status Response Category
test_step_describe {"action_type": "DESCRIBE", "argument": "employees"} 200 SQLObservation with schema result happy
test_step_sample {"action_type": "SAMPLE", "argument": "employees"} 200 SQLObservation with sample rows happy
test_step_query {"action_type": "QUERY", "argument": "SELECT 1"} 200 SQLObservation with query result happy
test_step_answer {"action_type": "ANSWER", "argument": "42"} 200 SQLObservation with done=True happy
test_step_missing_action_type {"argument": "x"} 422 Validation error error
test_step_missing_argument {"action_type": "QUERY"} 422 Validation error error
test_step_empty_body {} 422 Validation error error
test_step_invalid_action_type {"action_type": "HACK", "argument": "x"} 200 SQLObservation with error field set error
test_step_without_reset Call step before reset 200 or 500 Graceful error (not a crash) error

Run: pytest tests/api/test_endpoints.py -v


4. E2E Tests

Scenario: Agent Solves Easy Question

Setup: Environment initialized with Spider questions and databases. Select an easy question with a known answer. Actions:

  1. POST /reset with seed that selects an easy question
  2. POST /step with DESCRIBE on relevant table
  3. POST /step with SAMPLE on relevant table
  4. POST /step with QUERY using correct SQL
  5. POST /step with ANSWER providing correct value Expected: reward=1.0, done=True, total steps <= 15

Scenario: Agent Exhausts Budget

Setup: Environment initialized with budget=3. Actions:

  1. POST /reset
  2. POST /step with DESCRIBE x3 Expected: After 3rd step: done=True, reward=0.0, budget_remaining=0

Scenario: Agent Submits Wrong Answer

Setup: Environment initialized normally. Actions:

  1. POST /reset
  2. POST /step with ANSWER providing deliberately wrong value Expected: done=True, reward=0.0

Scenario: Agent Recovers From Errors

Setup: Environment initialized normally. Actions:

  1. POST /reset
  2. POST /step with QUERY "DROP TABLE x" (rejected)
  3. POST /step with DESCRIBE "nonexistent_table" (error)
  4. POST /step with valid QUERY
  5. POST /step with correct ANSWER Expected: Error steps counted against budget, valid steps succeed, episode completes with reward=1.0

Run: pytest tests/e2e/test_full_episodes.py -v


5. Edge Cases Checklist

  • Null/None seed and episode_id in reset
  • Empty string argument for all action types
  • Whitespace-only argument for all action types
  • Very long SQL query (10KB+)
  • SQL with unicode characters in string literals
  • SQL with special characters (backticks, double quotes, brackets)
  • Table name with special characters or spaces
  • ANSWER with numeric value as string vs integer
  • ANSWER with leading zeros ("042" vs "42")
  • ANSWER with trailing decimal ("42.0" vs "42")
  • Budget of 1 (single step before exhaustion)
  • Budget of 0 (should episode immediately end?)
  • Calling step after episode is done
  • Calling reset multiple times without stepping
  • Concurrent episodes (if supported)
  • Database with zero tables
  • Database with very large tables (performance)
  • Question whose gold_sql returns empty result
  • SELECT with subqueries, CTEs, UNION
  • SQL injection attempts via argument field ("; DROP TABLE--)
  • Very long table name in DESCRIBE/SAMPLE

6. Evidence Requirements

Category Evidence Type Example
Unit tests pytest output 45 passed
Integration pytest output 5 passed
API tests pytest output or httpx test client 9 passed
E2E pytest output with real Spider DB 4 passed
SQL sandbox Demonstrate write rejection ValueError: Only SELECT queries are allowed
Timeout Demonstrate query interruption Query timed out after 5.0 seconds
Budget Demonstrate forced termination done=True, reward=0.0 at budget=0
Answer comparison Demonstrate case-insensitive match "PARIS" == "paris" -> reward=1.0