Spaces:

hjerpe
/

sql_env

Running

App Files Files Community

sql_env / specs /F001-VERIFICATION_SPEC.md

hjerpe

Upload folder using huggingface_hub

5dd1bb4 verified 21 days ago

preview code

raw

history blame contribute delete

23.1 kB

Verification Specification

Feature: F001 Generated from: specs/F001-VERIFICATION_INPUT.json Generated: 2026-03-24

1. Unit Tests

1.1 SQLAction Type

Test	Description	Input	Expected	Category
test_sqlaction_valid_describe	Create action with DESCRIBE type	`SQLAction(action_type="DESCRIBE", argument="employees")`	Fields set correctly	happy
test_sqlaction_valid_sample	Create action with SAMPLE type	`SQLAction(action_type="SAMPLE", argument="orders")`	Fields set correctly	happy
test_sqlaction_valid_query	Create action with QUERY type	`SQLAction(action_type="QUERY", argument="SELECT 1")`	Fields set correctly	happy
test_sqlaction_valid_answer	Create action with ANSWER type	`SQLAction(action_type="ANSWER", argument="42")`	Fields set correctly	happy
test_sqlaction_empty_argument	Argument is empty string	`SQLAction(action_type="QUERY", argument="")`	Accepted at type level (validation at step)	edge
test_sqlaction_whitespace_argument	Argument is only whitespace	`SQLAction(action_type="QUERY", argument=" ")`	Accepted at type level	edge
test_sqlaction_serialization	Round-trip JSON serialization	Create, serialize, deserialize	Identical fields	happy

1.2 SQLObservation Type

Test	Description	Input	Expected	Category
test_obs_all_fields_present	All required fields populated	Full observation construction	All fields accessible with correct types	happy
test_obs_done_false_initial	Initial observation has done=False	After reset	`done == False`	happy
test_obs_reward_none_nonterminal	Non-terminal obs has reward=None	After non-ANSWER step	`reward is None`	happy
test_obs_reward_set_terminal	Terminal obs has numeric reward	After ANSWER step	`reward in {0.0, 1.0}`	happy
test_obs_step_count_type	step_count is int	Any observation	`isinstance(step_count, int)`	happy
test_obs_budget_remaining_type	budget_remaining is int	Any observation	`isinstance(budget_remaining, int)`	happy
test_obs_action_history_list	action_history is a list of strings	After several steps	`isinstance(action_history, list)` and all elements are `str`	happy
test_obs_error_empty_on_success	error is empty string on success	After successful action	`error == ""`	happy
test_obs_schema_info_nonempty	schema_info is non-empty after reset	After reset	`len(schema_info) > 0`	happy

1.3 QuestionRecord Type

Test	Description	Input	Expected	Category
test_qr_all_fields	All fields populated from valid JSON	Parsed question JSON	All 8 fields present and correct types	happy
test_qr_difficulty_values	difficulty in allowed set	Various records	`difficulty in {"easy", "medium", "hard"}`	happy
test_qr_answer_type_values	answer_type in allowed set	Various records	`answer_type in {"integer", "float", "string", "list"}`	happy
test_qr_tables_involved_nonempty	tables_involved has at least one entry	Valid question	`len(tables_involved) >= 1`	happy

1.4 EpisodeContext Type

Test	Description	Input	Expected	Category
test_ctx_initial_step_count	step_count starts at 0	After reset	`step_count == 0`	happy
test_ctx_initial_budget	budget starts at configured value (default 15)	After reset	`budget == 15`	happy
test_ctx_described_tables_empty	described_tables starts empty	After reset	`len(described_tables) == 0`	happy
test_ctx_action_log_empty	action_log starts empty	After reset	`len(action_log) == 0`	happy
test_ctx_done_false_initial	done starts False	After reset	`done == False`	happy
test_ctx_gold_answer_computed	gold_answer is set after reset	After reset	`gold_answer is not None`	happy

1.5 SQLEnvironment.init

Test	Description	Input	Expected	Category
test_init_valid	Init with valid paths loads questions	Valid questions_path, db_dir	`len(questions) > 0`	happy
test_init_missing_questions_file	questions_path does not exist	`"/nonexistent/q.json"`	`FileNotFoundError`	error
test_init_missing_db_dir	db_dir does not exist	`"/nonexistent/dbs/"`	`FileNotFoundError` or `ValueError`	error
test_init_invalid_json	questions file is not valid JSON	Path to file with `"{bad"`	`ValueError`	error
test_init_empty_questions	questions file is empty array `[]`	Path to file with `"[]"`	`ValueError` (no questions)	error
test_init_custom_budget	Custom step_budget is stored	`step_budget=10`	Environment uses 10-step budget	happy
test_init_default_budget	Default step_budget is 15	Omit step_budget	Budget is 15	happy

1.6 SQLEnvironment.reset

Test	Description	Input	Expected	Category
test_reset_returns_observation	Returns SQLObservation	`reset()`	`isinstance(result, SQLObservation)`	happy
test_reset_obs_has_question	Observation contains question text	`reset()`	`len(obs.question) > 0`	happy
test_reset_obs_has_table_names	schema_info contains table names	`reset()`	schema_info mentions at least one table	happy
test_reset_obs_no_columns	schema_info does NOT reveal columns initially	`reset()`	No column names in schema_info	happy
test_reset_obs_done_false	done is False	`reset()`	`obs.done == False`	happy
test_reset_obs_reward_none	reward is None	`reset()`	`obs.reward is None`	happy
test_reset_obs_step_count_zero	step_count is 0	`reset()`	`obs.step_count == 0`	happy
test_reset_obs_budget_full	budget_remaining equals configured budget	`reset()`	`obs.budget_remaining == 15`	happy
test_reset_obs_empty_history	action_history is empty	`reset()`	`obs.action_history == []`	happy
test_reset_obs_empty_error	error is empty string	`reset()`	`obs.error == ""`	happy
test_reset_seed_determinism	Same seed yields same question	`reset(seed=42)` twice	Same question both times	happy
test_reset_different_seeds	Different seeds can yield different questions	`reset(seed=1)` vs `reset(seed=999)`	Not necessarily same question (probabilistic)	happy
test_reset_no_seed_random	Without seed, question is randomly selected	`reset()` multiple times	At least one different question (probabilistic)	happy
test_reset_episode_id_set	Custom episode_id is reflected	`reset(episode_id="ep-123")`	Context has episode_id="ep-123"	happy
test_reset_missing_db_file	Question references nonexistent DB	Question with bad db_name	`FileNotFoundError`	error
test_reset_clears_previous_episode	Calling reset mid-episode starts fresh	reset, step, reset	Second reset has step_count=0, empty history	happy

1.7 SQLEnvironment.step

Test	Description	Input	Expected	Category
test_step_returns_observation	Returns SQLObservation	Any valid action	`isinstance(result, SQLObservation)`	happy
test_step_never_raises	step never raises exceptions	Various invalid inputs	Returns obs with error field, no exception	happy
test_step_increments_step_count	step_count increases by 1	DESCRIBE action	`obs.step_count == 1`	happy
test_step_decrements_budget	budget_remaining decreases by 1	DESCRIBE/SAMPLE/QUERY	`obs.budget_remaining == 14`	happy
test_step_answer_no_budget_decrement	ANSWER does not decrement budget	ANSWER action	budget unchanged from before	happy
test_step_appends_action_history	action_history grows each step	Two DESCRIBE actions	`len(obs.action_history) == 2`	happy
test_step_invalid_action_type	Unknown action_type returns error	`action_type="UNKNOWN"`	`obs.error` contains "Unknown action type" and lists valid types	error
test_step_empty_argument	Empty argument returns error	`argument=""`	`obs.error` contains "cannot be empty"	error
test_step_whitespace_argument	Whitespace-only argument returns error	`argument=" "`	`obs.error` contains "cannot be empty"	error

1.8 SQLEnvironment._execute_sql

Test	Description	Input	Expected	Category
test_exec_valid_select	Executes a valid SELECT	`"SELECT 1"`	Returns list with results	happy
test_exec_non_select_rejected	Non-SELECT is rejected	`"DROP TABLE x"`	`ValueError` raised	error
test_exec_insert_rejected	INSERT is rejected	`"INSERT INTO t VALUES(1)"`	`ValueError` raised	error
test_exec_update_rejected	UPDATE is rejected	`"UPDATE t SET x=1"`	`ValueError` raised	error
test_exec_delete_rejected	DELETE is rejected	`"DELETE FROM t"`	`ValueError` raised	error
test_exec_create_rejected	CREATE is rejected	`"CREATE TABLE t(x INT)"`	`ValueError` raised	error
test_exec_semicolon_multi	Multiple statements rejected	`"SELECT 1; DROP TABLE t"`	Error or only first executed safely	error
test_exec_syntax_error	Malformed SQL	`"SELCET * FORM t"`	`sqlite3.OperationalError`	error
test_exec_timeout	Query exceeding timeout is interrupted	Long-running query (e.g., cartesian join)	Error raised within ~5 seconds	error
test_exec_result_truncation	Results truncated to 20 rows	Query returning 100 rows	`len(result) <= 20`	edge
test_exec_empty_result	Query returning no rows	`"SELECT * FROM t WHERE 1=0"`	Empty list `[]`	edge
test_exec_read_only	Connection is truly read-only	Attempt write via raw SQL	Error raised	error
test_exec_case_insensitive_select	`select` (lowercase) is accepted	`"select 1"`	Returns results	edge
test_exec_select_with_leading_whitespace	Leading whitespace before SELECT	`" SELECT 1"`	Returns results	edge
test_exec_select_with_comment	SQL comment before SELECT	`"-- comment\nSELECT 1"`	Handled correctly (accepted or rejected consistently)	edge

1.9 SQLEnvironment._handle_describe

Test	Description	Input	Expected	Category
test_describe_valid_table	Known table returns schema info	Existing table name	Result contains column names, types, row count	happy
test_describe_unknown_table	Unknown table returns error with list	`"nonexistent_table"`	Error containing "not found" and listing available tables	error
test_describe_case_sensitivity	Table name matching case behavior	Mixed case table name	Consistent behavior (either matches or provides helpful error)	edge
test_describe_updates_described_set	Described table is tracked	DESCRIBE a table	Table appears in described_tables set	happy
test_describe_repeated_table	Describing same table twice works	DESCRIBE same table twice	No error, returns same info	edge

1.10 SQLEnvironment._handle_sample

Test	Description	Input	Expected	Category
test_sample_valid_table	Known table returns sample rows	Existing table name	Formatted rows returned	happy
test_sample_unknown_table	Unknown table returns error	`"nonexistent_table"`	Error with available table list	error
test_sample_default_limit	Default limit is 5 rows	Omit limit parameter	At most 5 rows returned	happy
test_sample_empty_table	Table with no rows	Empty table	Returns empty or "no rows" result	edge

1.11 SQLEnvironment._handle_query

Test	Description	Input	Expected	Category
test_query_valid_select	Valid SELECT returns formatted results	`"SELECT * FROM t LIMIT 3"`	Formatted result string	happy
test_query_non_select	Non-SELECT returns error	`"DROP TABLE t"`	Error about SELECT-only	error
test_query_syntax_error	Bad SQL returns error	`"SELCET *"`	Error with sqlite3 message	error
test_query_timeout	Slow query returns timeout error	Expensive query	Error mentioning "timed out" and "5.0 seconds"	error
test_query_truncation_indicator	>20 rows shows truncation notice	Query returning many rows	Result indicates truncation occurred	edge
test_query_exactly_20_rows	Exactly 20 rows shows no truncation	Query returning 20 rows	No truncation indicator	edge

1.12 SQLEnvironment._handle_answer

Test	Description	Input	Expected	Category
test_answer_correct	Correct answer yields reward 1.0	Gold answer value	`(True, 1.0)`	happy
test_answer_incorrect	Incorrect answer yields reward 0.0	Wrong value	`(False, 0.0)`	happy
test_answer_case_insensitive	Case-insensitive comparison	`"PARIS"` vs gold `"paris"`	`(True, 1.0)`	edge
test_answer_whitespace_stripped	Leading/trailing whitespace stripped	`" 42 "` vs gold `"42"`	`(True, 1.0)`	edge
test_answer_sets_done	Episode is marked done after answer	Any answer	`done == True`	happy
test_answer_empty_string	Empty answer handled	`""`	`(False, 0.0)`	edge

1.13 SQLEnvironment._load_questions

Test	Description	Input	Expected	Category
test_load_valid_json	Valid questions JSON parsed	Path to valid file	`list[QuestionRecord]` with correct count	happy
test_load_file_not_found	Missing file raises error	`"/nonexistent.json"`	`FileNotFoundError`	error
test_load_invalid_json	Malformed JSON raises error	Path to `"{bad"`	`ValueError`	error
test_load_missing_fields	Record missing required field	JSON with missing `gold_sql`	`ValueError` or `KeyError`	error

1.14 SQLEnvironment._open_db

Test	Description	Input	Expected	Category
test_open_valid_db	Opens existing database	Valid db_name	`sqlite3.Connection` returned	happy
test_open_missing_db	Missing database raises error	`"nonexistent_db"`	`FileNotFoundError`	error
test_open_read_only	Connection is read-only	Valid db_name	Write operations fail	happy

1.15 SQLEnvironment._build_observation

Test	Description	Input	Expected	Category
test_build_obs_reflects_context	Observation matches episode context	Various context states	Fields match context values	happy
test_build_obs_schema_progressive	schema_info includes described columns	After DESCRIBE action	schema_info contains column details for described tables	happy
test_build_obs_schema_initial	Initial schema_info has only table names	Before any DESCRIBE	Only table names, no columns	happy

Run: pytest tests/unit/test_sql_environment.py -v

2. Integration Tests

Flow: Full Episode Lifecycle (Reset-Explore-Answer)

Step	Action	Expected	Verification
1	`reset(seed=42)`	Observation with question, table names, step=0, budget=15	Assert all initial observation fields
2	`step(DESCRIBE, "employees")`	Column info in result, step=1, budget=14	Assert result contains column names, step incremented
3	`step(SAMPLE, "employees")`	Sample rows in result, step=2, budget=13	Assert formatted rows in result
4	`step(QUERY, "SELECT COUNT(*) FROM employees")`	Count result, step=3, budget=12	Assert numeric result in output
5	`step(ANSWER, "<correct_value>")`	done=True, reward=1.0	Assert terminal state with positive reward
6	Verify no more steps accepted	Post-episode step attempt handled gracefully	Assert error or episode-over signal

Flow: Budget Exhaustion

Step	Action	Expected	Verification
1	`reset()` with budget=3	Observation with budget=3	Assert budget_remaining==3
2	`step(DESCRIBE, "t1")`	budget=2	Assert budget_remaining==2
3	`step(DESCRIBE, "t2")`	budget=1	Assert budget_remaining==1
4	`step(DESCRIBE, "t3")`	budget=0, done=True, reward=0.0	Assert terminal state, zero reward

Flow: Error Recovery Within Episode

Step	Action	Expected	Verification
1	`reset()`	Normal initial observation	Assert done==False
2	`step(QUERY, "DROP TABLE x")`	Error in obs.error, done=False	Assert error message, episode continues
3	`step(QUERY, "SELCET * FORM t")`	Error in obs.error, done=False	Assert SQL error message, episode continues
4	`step(DESCRIBE, "nonexistent")`	Error in obs.error, done=False	Assert "not found" error, episode continues
5	`step(QUERY, "SELECT 1")`	Success result, no error	Assert error=="" and result has data

Flow: Progressive Schema Discovery

Step	Action	Expected	Verification
1	`reset()`	schema_info has table names only	Assert table names present, no column details
2	`step(DESCRIBE, "table_a")`	schema_info now includes table_a columns	Assert column names for table_a appear
3	`step(DESCRIBE, "table_b")`	schema_info includes both table_a and table_b columns	Assert both tables' columns present

Flow: Seed Determinism

Step	Action	Expected	Verification
1	`reset(seed=42)`	Observation A	Record question and db
2	`reset(seed=42)`	Observation B identical to A	Assert same question_text and schema_info

Run: pytest tests/integration/test_episode_lifecycle.py -v

3. API Tests

POST /reset

Test	Request	Status	Response	Category
test_reset_no_params	`{}`	200	SQLObservation with done=False, budget=15	happy
test_reset_with_seed	`{"seed": 42}`	200	Deterministic SQLObservation	happy
test_reset_with_episode_id	`{"episode_id": "ep-1"}`	200	SQLObservation (episode_id reflected in context)	happy
test_reset_null_seed	`{"seed": null}`	200	Random question selected	happy
test_reset_server_error_missing_db	(DB file removed)	500	Error response indicating database not found	error
test_reset_response_schema	`{}`	200	Response contains all SQLObservation fields	happy

POST /step

Test	Request	Status	Response	Category
test_step_describe	`{"action_type": "DESCRIBE", "argument": "employees"}`	200	SQLObservation with schema result	happy
test_step_sample	`{"action_type": "SAMPLE", "argument": "employees"}`	200	SQLObservation with sample rows	happy
test_step_query	`{"action_type": "QUERY", "argument": "SELECT 1"}`	200	SQLObservation with query result	happy
test_step_answer	`{"action_type": "ANSWER", "argument": "42"}`	200	SQLObservation with done=True	happy
test_step_missing_action_type	`{"argument": "x"}`	422	Validation error	error
test_step_missing_argument	`{"action_type": "QUERY"}`	422	Validation error	error
test_step_empty_body	`{}`	422	Validation error	error
test_step_invalid_action_type	`{"action_type": "HACK", "argument": "x"}`	200	SQLObservation with error field set	error
test_step_without_reset	Call step before reset	200 or 500	Graceful error (not a crash)	error

Run: pytest tests/api/test_endpoints.py -v

4. E2E Tests

Scenario: Agent Solves Easy Question

Setup: Environment initialized with Spider questions and databases. Select an easy question with a known answer. Actions:

POST /reset with seed that selects an easy question
POST /step with DESCRIBE on relevant table
POST /step with SAMPLE on relevant table
POST /step with QUERY using correct SQL
POST /step with ANSWER providing correct value Expected: reward=1.0, done=True, total steps <= 15

Scenario: Agent Exhausts Budget

Setup: Environment initialized with budget=3. Actions:

POST /reset
POST /step with DESCRIBE x3 Expected: After 3rd step: done=True, reward=0.0, budget_remaining=0

Scenario: Agent Submits Wrong Answer

Setup: Environment initialized normally. Actions:

POST /reset
POST /step with ANSWER providing deliberately wrong value Expected: done=True, reward=0.0

Scenario: Agent Recovers From Errors

Setup: Environment initialized normally. Actions:

POST /reset
POST /step with QUERY "DROP TABLE x" (rejected)
POST /step with DESCRIBE "nonexistent_table" (error)
POST /step with valid QUERY
POST /step with correct ANSWER Expected: Error steps counted against budget, valid steps succeed, episode completes with reward=1.0

Run: pytest tests/e2e/test_full_episodes.py -v

5. Edge Cases Checklist

Null/None seed and episode_id in reset
Empty string argument for all action types
Whitespace-only argument for all action types
Very long SQL query (10KB+)
SQL with unicode characters in string literals
SQL with special characters (backticks, double quotes, brackets)
Table name with special characters or spaces
ANSWER with numeric value as string vs integer
ANSWER with leading zeros ("042" vs "42")
ANSWER with trailing decimal ("42.0" vs "42")
Budget of 1 (single step before exhaustion)
Budget of 0 (should episode immediately end?)
Calling step after episode is done
Calling reset multiple times without stepping
Concurrent episodes (if supported)
Database with zero tables
Database with very large tables (performance)
Question whose gold_sql returns empty result
SELECT with subqueries, CTEs, UNION
SQL injection attempts via argument field ("; DROP TABLE--)
Very long table name in DESCRIBE/SAMPLE

6. Evidence Requirements

Category	Evidence Type	Example
Unit tests	pytest output	`45 passed`
Integration	pytest output	`5 passed`
API tests	pytest output or httpx test client	`9 passed`
E2E	pytest output with real Spider DB	`4 passed`
SQL sandbox	Demonstrate write rejection	`ValueError: Only SELECT queries are allowed`
Timeout	Demonstrate query interruption	`Query timed out after 5.0 seconds`
Budget	Demonstrate forced termination	`done=True, reward=0.0 at budget=0`
Answer comparison	Demonstrate case-insensitive match	`"PARIS" == "paris" -> reward=1.0`