Research Summary
Project: SQLEnv - Question Dataset Expansion Change: F004 - Expand from 53 questions (one DB) to 100+ questions across 5-10 Spider databases with difficulty labels, answer_type metadata, gold_answer fields, and train/eval split Date: 2026-03-24 Status: Draft
1. Change Overview
What We're Changing
Expanding the question dataset from the current 53 Spider questions for a single database (student_assessment) to 100+ curated questions spanning 5-10 Spider databases. Each question will be enriched with metadata fields: difficulty (easy/medium/hard at 40/40/20 split), answer_type (integer/float/string/list/table), gold_answer (pre-computed), and tables_involved. The dataset will be split into train (70%) and eval (30%) partitions.
Why We're Changing It
Training on a single database schema risks overfitting. The agent needs diverse schemas, question patterns, and difficulty levels to develop generalizable SQL exploration strategies. Pre-computed gold answers avoid re-executing gold SQL every RL episode, improving training throughput.
Success Criteria
- 100+ questions across 5-10 Spider databases
- Difficulty distribution: ~40% easy (1-2 tables), ~40% medium (2-3 tables), ~20% hard (4+ tables)
- Every question has:
gold_answer,answer_type,difficulty,tables_involved - Train/eval split at 70/30 with no cross-contamination
- No questions requiring SQL features unsupported by SQLite
- Diverse answer types (integer, float, string, list) and SQL patterns (aggregation, joins, subqueries, grouping)
2. System Context
Current Behavior
The system stores 53 raw Spider questions in data/questions/student_assessment.json. These questions use the Spider dataset's native format with fields: db_id, query (gold SQL), question (natural language), query_toks, query_toks_no_value, and question_toks. There are no difficulty, answer_type, gold_answer, or tables_involved fields. There is no train/eval split.
The SQLEnvironment class in server/sql_environment.py currently hardcodes the student_assessment schema: it imports all 9 ORM models by name and builds a static schema description string in _build_schema_description(). Questions are not yet loaded or used in the environment loop (the conceptual QuestionRecord is defined in comments in models.py but not implemented).
Architecture Context
data/
questions/
student_assessment.json <-- Current: raw Spider format, 53 questions
(new files per database) <-- Target: enriched format, 100+ questions
databases/
models.py <-- Current: student_assessment ORM only
(new models per database) <-- Target: ORM per database OR direct SQLite
scripts/
download_spider_data.py <-- Downloads questions from HuggingFace
generate_models_from_schema.py <-- Auto-generates ORM from Spider schema
server/
sql_environment.py <-- Hardcoded to student_assessment schema
verifier.py <-- Stub; will use answer_type for comparison
models.py <-- QuestionRecord conceptual design (comments)
F004 is a data-layer feature that produces the enriched question files. It does not implement the environment's question-loading logic (that belongs to F001 Core Environment Loop), but it must produce data in the format that F001/F002/F003 will consume.
Entry Points
| Entry Point | Trigger | Current Flow |
|---|---|---|
data/questions/student_assessment.json |
Read at import/reset | Raw Spider format; 53 entries with db_id, query, question, tokenized variants |
scripts/download_spider_data.py |
Manual CLI invocation | Downloads Spider questions from xlangai/spider HuggingFace dataset, filters by db_id, saves raw JSON |
scripts/generate_models_from_schema.py |
Manual CLI invocation | Downloads schema from richardr1126/spider-schema, generates SQLAlchemy model .py files |
data/databases/models.py |
Imported by sql_environment.py |
Hand-written SQLAlchemy ORM for student_assessment (9 tables) |
Data Flow
| Data | Source | Shape/Type | Destination |
|---|---|---|---|
| Raw Spider questions | HuggingFace xlangai/spider |
[{db_id, query, question, query_toks, query_toks_no_value, question_toks}] |
data/questions/{db_id}.json |
| Spider schema | HuggingFace richardr1126/spider-schema |
[{db_id, table: [{name, columns: [{name, type}]}], foreign_keys}] |
data/models/{db_id}.py (generated ORM) |
| Enriched questions (target) | Curation script (new) | [{id, db_id, question, gold_sql, gold_answer, answer_type, difficulty, tables_involved, split}] |
data/questions/ or single manifest file |
| SQLite database files | Spider dataset (to be downloaded) | .sqlite files |
data/databases/{db_id}/{db_id}.sqlite |
3. Dependencies
Code We Depend On
| Dependency | What We Use | Risk if Changed |
|---|---|---|
datasets (HuggingFace) |
load_dataset("xlangai/spider") for questions, load_dataset("richardr1126/spider-schema") for schemas |
Dataset API changes could break download scripts |
Spider dataset (xlangai/spider) |
Raw questions with gold SQL | Dataset structure is stable (academic benchmark) |
Spider schema dataset (richardr1126/spider-schema) |
Table/column definitions for ORM generation | Third-party dataset; less stable than official Spider |
sqlite3 (stdlib) |
Execute gold SQL to compute gold_answer |
Stable (stdlib) |
| SQLAlchemy | ORM model definitions used by environment | Already a project dependency |
Code That Depends On Us
| Dependent | How They Use Us | Impact of Our Change |
|---|---|---|
| F001 (Core Environment Loop) | Loads questions from JSON at reset(), selects question, opens SQLite database |
Must produce questions in format matching QuestionRecord conceptual design in models.py |
| F002 (Answer Verification) | Uses answer_type and gold_answer to verify agent submissions |
Must correctly classify answer types and pre-compute gold answers |
| F003 (Dense Reward) | Uses gold_answer for progress-to-target comparison (Layer 2) |
Gold answers must be deterministic and correct |
| F006 (GRPO Training) | Uses train split for training, eval split for evaluation | Train/eval split must be clean |
server/sql_environment.py |
Currently hardcodes student_assessment ORM imports and schema description |
Multi-database support will require changes to environment (F001 scope), but F004 must provide the data |
External Systems
| System | Integration Point | Considerations |
|---|---|---|
| HuggingFace Hub | datasets.load_dataset() |
Network required for initial download; cache locally |
| Spider SQLite databases | Direct file access | No .sqlite files exist in repo yet; must be downloaded or created |
4. Risks & Edge Cases
Identified Risks
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Gold SQL produces different results across SQLite versions | Low | Incorrect gold_answer, bad reward signal | Pin SQLite version; validate gold answers on target SQLite |
Some Spider questions use SQL features not in SQLite (e.g., ILIKE, DATEDIFF) |
Medium | Questions fail to execute | Filter questions by executing gold SQL against actual SQLite; exclude failures |
Spider database .sqlite files not available via HuggingFace datasets API |
Medium | Cannot execute gold SQL to compute gold_answer | Download .sqlite files from Spider GitHub repo or reconstruct from schema |
| Ambiguous gold answers (queries returning non-deterministic order) | Medium | Reward gives false negatives | For list/table answer types, use order-insensitive comparison; flag and review ORDER BY-dependent queries |
| Difficulty classification is subjective | Low | Uneven difficulty distribution | Use heuristic: count distinct tables in gold SQL to assign difficulty |
| Train/eval data leakage (same question rephrased across Spider train/validation) | Low | Overfitting on eval set | Use Spider's own train/validation split as basis; additionally deduplicate by gold SQL |
Edge Cases to Handle
| Edge Case | Current Behavior | Required Behavior |
|---|---|---|
| Gold SQL returns empty result set | N/A | Include as valid question; gold_answer = empty list/table; answer_type = "list" or "table" |
| Gold SQL returns NULL values | N/A | Normalize NULLs to Python None; handle in answer_type classification |
| Multiple valid gold SQLs for same question | Only one gold SQL per Spider question | Accept Spider's single gold SQL; note that alternative SQL may produce same answer |
| Database has no questions in Spider | N/A | Skip database during curation |
| Question text contains typos/ambiguity (Spider known issue) | N/A | Accept as-is for MVP; flag obvious issues |
Invariants to Preserve
- Every question in the dataset has all required fields populated (no partial records)
- Every
gold_sqlexecutes successfully against its corresponding SQLite database - Every
gold_answermatches the result of executinggold_sql - Train and eval splits have no overlapping question IDs
- Difficulty distribution approximates 40/40/20 (easy/medium/hard)
- No question requires SQL features unsupported by SQLite
4b. Code Shape & Design Target
Existing Vocabulary
| Concept | Existing Name | Location |
|---|---|---|
| Question metadata (conceptual) | QuestionRecord |
models.py lines 224-235 (commented design) |
| Database identifier | db_id |
Spider format field; used throughout download_spider_data.py |
| Gold SQL | query (Spider) / gold_sql (QuestionRecord) |
student_assessment.json / models.py |
| Answer types | integer, float, string, list, table |
models.py line 233, server/verifier.py docstring |
| Difficulty levels | easy, medium, hard |
models.py line 234 |
| ORM models dictionary | self.db_models |
server/sql_environment.py line 127 |
| Spider download function | download_spider_questions() |
scripts/download_spider_data.py |
| Model generation function | generate_simplified_models() |
scripts/generate_models_from_schema.py |
Language/Framework Idioms
- Python scripts with argparse CLI in
scripts/directory - JSON files for data storage (not YAML, not CSV)
- SQLAlchemy declarative ORM for database schema (though direct SQLite may suffice for F004)
- Pydantic models for typed data contracts (
models.py) - HuggingFace
datasetslibrary for Spider data access - Type hints throughout;
pathlib.Pathfor file operations - Docstrings in Google style with Args/Returns sections
Target Shape
| Component | Purpose | Why This Boundary |
|---|---|---|
scripts/curate_questions.py |
Main curation script: download questions for selected DBs, enrich with metadata, compute gold answers, assign difficulty, create splits, validate, output final dataset | Single script matching existing scripts/ pattern; orchestrates the full pipeline |
data/questions/questions_train.json |
Training split (70%) of enriched questions | Consumed by F001 at reset(); separate file makes split explicit |
data/questions/questions_eval.json |
Evaluation split (30%) of enriched questions | Consumed by F005 Green Agent; prevents training on eval data |
data/databases/{db_id}/{db_id}.sqlite |
SQLite database files per Spider database | Required to execute gold SQL and compute gold_answer; also needed by F001 for live query execution |
Abstraction Level
- Current level: Flat scripts in
scripts/. Data as JSON files indata/. No abstraction layers for data loading. - Recommendation: Match existing flat style. One script that does everything end-to-end. Output is JSON files. No data-loading library or ORM for the curation pipeline itself -- just
sqlite3andjson. The environment's question-loading code belongs to F001, not F004.
Anti-Patterns to Avoid
- Do not create a complex data pipeline framework (e.g., classes like
QuestionCurator,DifficultyClassifier,AnswerTypeDetector). A single script with clear functions is sufficient. - Do not generate SQLAlchemy ORM models per database for F004 purposes. The curation script only needs
sqlite3to execute gold SQL. Whether the environment needs ORM models per DB is an F001 decision. - Do not embed the curation logic inside the server code. Keep it as a standalone script that produces static JSON files.
- Do not hardcode database selection. Use a configuration list (or CLI args) so databases can be added/removed easily.
5. Constraints
Technical Constraints
| Constraint | Requirement | Notes |
|---|---|---|
| SQLite compatibility | All gold SQL must execute on SQLite | Spider was designed for SQLite; ~99% compatible, but verify edge cases |
| Dataset size | 100+ questions minimum | Quality over quantity; user specified 100 as sufficient for MVP |
| Difficulty split | ~40% easy / ~40% medium / ~20% hard | Hard questions (4+ tables) are rarer in Spider; may need to pull from more databases |
| Answer types | Cover integer, float, string, list at minimum | Table type can be deferred per F002 user interview |
| No network at runtime | Questions and SQLite files must be committed to repo or downloaded once | Curation script runs offline after initial download |
Pattern Constraints
- Question ID format: Use
{db_id}_{split}_{index}(e.g.,concert_singer_train_007) to match the conceptualQuestionRecord.question_idformat likespider_dev_042 - Output JSON must match the
QuestionRecordfields defined inmodels.pycomments:question_id,question_text,database_name,gold_sql,gold_answer,answer_type,difficulty,tables_involved - Spider's own train/validation split should be respected as the basis (train questions -> train split, validation questions -> eval split)
Testing Constraints
| Test Suite | Coverage Area | Notes |
|---|---|---|
tests/test_smoke.py |
Environment instantiation, action detection, serialization | Must still pass; F004 should not change server code |
| New: dataset validation tests | All questions valid, gold SQL executes, splits clean | Should be part of curation script's --validate mode or a separate test |
6. Open Questions
| Question | Why It Matters | Who Can Answer |
|---|---|---|
| Which specific Spider databases to include? | Determines schema diversity and question count | Researcher (see analysis below) |
| Where to get SQLite database files? | Spider HuggingFace datasets may not include .sqlite files directly |
Technical investigation |
| Should the output be per-database JSON files or a single manifest? | Affects how F001 loads questions | Architecture decision |
Spider Database Candidates (Research-Based Recommendations):
Good candidates for diverse schemas and well-formed questions based on Spider dataset characteristics:
student_assessment(already have; 53 questions, 9 tables) -- education domainconcert_singer(popular Spider DB; ~30 questions, 4 tables) -- entertainment domainworld_1(~30 questions, 3 tables) -- geography domaincar_1(~20 questions, 4 tables) -- automotive domainemployee_hire_evaluation(~20 questions, 4 tables) -- HR domainpets_1(~20 questions, 3 tables) -- simple schema, good for easy questionscre_Doc_Template_Mgt(~25 questions, 6 tables) -- document management domaindog_kennels(~25 questions, 7 tables) -- business domainflight_2(~20 questions, 5 tables) -- transportation domainpoker_player(~15 questions, 2 tables) -- simple, good for easy questions
These span diverse domains, table counts (2-9), and would yield ~250+ raw questions to curate down to 100+ high-quality ones.
SQLite Database Files:
The Spider dataset's SQLite files are typically obtained from the official Spider GitHub release (https://github.com/taoyds/spider), not from the HuggingFace datasets API. The xlangai/spider HuggingFace dataset contains questions but likely not the .sqlite files themselves. The curation script will need to either:
- Download
.sqlitefiles from the Spider GitHub release - Reconstruct databases from the schema dataset using
CREATE TABLE+INSERTstatements
Option 1 is more reliable. The Spider GitHub release includes a database/ directory with all SQLite files.
7. Context Sources
| Source | Type | Notes |
|---|---|---|
data/questions/student_assessment.json |
Code/Data | Current format: raw Spider with db_id, query, question, tokenized variants. Missing: difficulty, answer_type, gold_answer, tables_involved |
scripts/download_spider_data.py |
Code | Downloads from xlangai/spider HuggingFace dataset. Supports --db-id filter and --split (train/validation) |
scripts/generate_models_from_schema.py |
Code | Downloads from richardr1126/spider-schema. Generates SQLAlchemy ORM files. Uses generate_simplified_models() |
data/databases/models.py |
Code | Hand-written SQLAlchemy ORM for student_assessment. 9 tables with relationships. This is the reference quality for ORM models |
models.py |
Code | QuestionRecord conceptual design (lines 224-235): defines target fields question_id, question_text, database_name, gold_sql, gold_answer, answer_type, difficulty, tables_involved |
server/sql_environment.py |
Code | Hardcoded to student_assessment: imports 9 specific ORM models, builds static schema string. _build_schema_description() must match ORM |
server/verifier.py |
Code | Stub with docstring defining 5 answer types: integer, float, string, list, table |
server/reward.py |
Code | Stub referencing 3-layer reward. Layer 2 needs gold_answer for progress comparison |
docs/ARCHITECTURE.md |
Doc | System map showing current single-DB architecture. Notes SQLite files not yet present |
specs/FEATURES.json |
Doc | F004 definition with user interview context |
docs_draft/sql_env_project_brief.md |
Doc | Project brief: 50-100 questions target, multi-hop insight, difficulty progression |
Human Validation Checkpoint
Before proceeding to planning, please confirm:
- System context is accurate
- Dependencies are complete
- Risks are identified
- Constraints are correct
- Open questions can be resolved
Questions for reviewer:
- Is anything incorrect or missing?
- Are there risks I haven't identified?
- Should we proceed to planning?
Validated by: [NAME] on [DATE]