sql_env / specs /F004-RESEARCH_SUMMARY.md
hjerpe's picture
Upload folder using huggingface_hub
5dd1bb4 verified

Research Summary

Project: SQLEnv - Question Dataset Expansion Change: F004 - Expand from 53 questions (one DB) to 100+ questions across 5-10 Spider databases with difficulty labels, answer_type metadata, gold_answer fields, and train/eval split Date: 2026-03-24 Status: Draft


1. Change Overview

What We're Changing

Expanding the question dataset from the current 53 Spider questions for a single database (student_assessment) to 100+ curated questions spanning 5-10 Spider databases. Each question will be enriched with metadata fields: difficulty (easy/medium/hard at 40/40/20 split), answer_type (integer/float/string/list/table), gold_answer (pre-computed), and tables_involved. The dataset will be split into train (70%) and eval (30%) partitions.

Why We're Changing It

Training on a single database schema risks overfitting. The agent needs diverse schemas, question patterns, and difficulty levels to develop generalizable SQL exploration strategies. Pre-computed gold answers avoid re-executing gold SQL every RL episode, improving training throughput.

Success Criteria

  • 100+ questions across 5-10 Spider databases
  • Difficulty distribution: ~40% easy (1-2 tables), ~40% medium (2-3 tables), ~20% hard (4+ tables)
  • Every question has: gold_answer, answer_type, difficulty, tables_involved
  • Train/eval split at 70/30 with no cross-contamination
  • No questions requiring SQL features unsupported by SQLite
  • Diverse answer types (integer, float, string, list) and SQL patterns (aggregation, joins, subqueries, grouping)

2. System Context

Current Behavior

The system stores 53 raw Spider questions in data/questions/student_assessment.json. These questions use the Spider dataset's native format with fields: db_id, query (gold SQL), question (natural language), query_toks, query_toks_no_value, and question_toks. There are no difficulty, answer_type, gold_answer, or tables_involved fields. There is no train/eval split.

The SQLEnvironment class in server/sql_environment.py currently hardcodes the student_assessment schema: it imports all 9 ORM models by name and builds a static schema description string in _build_schema_description(). Questions are not yet loaded or used in the environment loop (the conceptual QuestionRecord is defined in comments in models.py but not implemented).

Architecture Context

data/
  questions/
    student_assessment.json    <-- Current: raw Spider format, 53 questions
    (new files per database)   <-- Target: enriched format, 100+ questions
  databases/
    models.py                  <-- Current: student_assessment ORM only
    (new models per database)  <-- Target: ORM per database OR direct SQLite

scripts/
  download_spider_data.py      <-- Downloads questions from HuggingFace
  generate_models_from_schema.py  <-- Auto-generates ORM from Spider schema

server/
  sql_environment.py           <-- Hardcoded to student_assessment schema
  verifier.py                  <-- Stub; will use answer_type for comparison

models.py                      <-- QuestionRecord conceptual design (comments)

F004 is a data-layer feature that produces the enriched question files. It does not implement the environment's question-loading logic (that belongs to F001 Core Environment Loop), but it must produce data in the format that F001/F002/F003 will consume.

Entry Points

Entry Point Trigger Current Flow
data/questions/student_assessment.json Read at import/reset Raw Spider format; 53 entries with db_id, query, question, tokenized variants
scripts/download_spider_data.py Manual CLI invocation Downloads Spider questions from xlangai/spider HuggingFace dataset, filters by db_id, saves raw JSON
scripts/generate_models_from_schema.py Manual CLI invocation Downloads schema from richardr1126/spider-schema, generates SQLAlchemy model .py files
data/databases/models.py Imported by sql_environment.py Hand-written SQLAlchemy ORM for student_assessment (9 tables)

Data Flow

Data Source Shape/Type Destination
Raw Spider questions HuggingFace xlangai/spider [{db_id, query, question, query_toks, query_toks_no_value, question_toks}] data/questions/{db_id}.json
Spider schema HuggingFace richardr1126/spider-schema [{db_id, table: [{name, columns: [{name, type}]}], foreign_keys}] data/models/{db_id}.py (generated ORM)
Enriched questions (target) Curation script (new) [{id, db_id, question, gold_sql, gold_answer, answer_type, difficulty, tables_involved, split}] data/questions/ or single manifest file
SQLite database files Spider dataset (to be downloaded) .sqlite files data/databases/{db_id}/{db_id}.sqlite

3. Dependencies

Code We Depend On

Dependency What We Use Risk if Changed
datasets (HuggingFace) load_dataset("xlangai/spider") for questions, load_dataset("richardr1126/spider-schema") for schemas Dataset API changes could break download scripts
Spider dataset (xlangai/spider) Raw questions with gold SQL Dataset structure is stable (academic benchmark)
Spider schema dataset (richardr1126/spider-schema) Table/column definitions for ORM generation Third-party dataset; less stable than official Spider
sqlite3 (stdlib) Execute gold SQL to compute gold_answer Stable (stdlib)
SQLAlchemy ORM model definitions used by environment Already a project dependency

Code That Depends On Us

Dependent How They Use Us Impact of Our Change
F001 (Core Environment Loop) Loads questions from JSON at reset(), selects question, opens SQLite database Must produce questions in format matching QuestionRecord conceptual design in models.py
F002 (Answer Verification) Uses answer_type and gold_answer to verify agent submissions Must correctly classify answer types and pre-compute gold answers
F003 (Dense Reward) Uses gold_answer for progress-to-target comparison (Layer 2) Gold answers must be deterministic and correct
F006 (GRPO Training) Uses train split for training, eval split for evaluation Train/eval split must be clean
server/sql_environment.py Currently hardcodes student_assessment ORM imports and schema description Multi-database support will require changes to environment (F001 scope), but F004 must provide the data

External Systems

System Integration Point Considerations
HuggingFace Hub datasets.load_dataset() Network required for initial download; cache locally
Spider SQLite databases Direct file access No .sqlite files exist in repo yet; must be downloaded or created

4. Risks & Edge Cases

Identified Risks

Risk Likelihood Impact Mitigation
Gold SQL produces different results across SQLite versions Low Incorrect gold_answer, bad reward signal Pin SQLite version; validate gold answers on target SQLite
Some Spider questions use SQL features not in SQLite (e.g., ILIKE, DATEDIFF) Medium Questions fail to execute Filter questions by executing gold SQL against actual SQLite; exclude failures
Spider database .sqlite files not available via HuggingFace datasets API Medium Cannot execute gold SQL to compute gold_answer Download .sqlite files from Spider GitHub repo or reconstruct from schema
Ambiguous gold answers (queries returning non-deterministic order) Medium Reward gives false negatives For list/table answer types, use order-insensitive comparison; flag and review ORDER BY-dependent queries
Difficulty classification is subjective Low Uneven difficulty distribution Use heuristic: count distinct tables in gold SQL to assign difficulty
Train/eval data leakage (same question rephrased across Spider train/validation) Low Overfitting on eval set Use Spider's own train/validation split as basis; additionally deduplicate by gold SQL

Edge Cases to Handle

Edge Case Current Behavior Required Behavior
Gold SQL returns empty result set N/A Include as valid question; gold_answer = empty list/table; answer_type = "list" or "table"
Gold SQL returns NULL values N/A Normalize NULLs to Python None; handle in answer_type classification
Multiple valid gold SQLs for same question Only one gold SQL per Spider question Accept Spider's single gold SQL; note that alternative SQL may produce same answer
Database has no questions in Spider N/A Skip database during curation
Question text contains typos/ambiguity (Spider known issue) N/A Accept as-is for MVP; flag obvious issues

Invariants to Preserve

  • Every question in the dataset has all required fields populated (no partial records)
  • Every gold_sql executes successfully against its corresponding SQLite database
  • Every gold_answer matches the result of executing gold_sql
  • Train and eval splits have no overlapping question IDs
  • Difficulty distribution approximates 40/40/20 (easy/medium/hard)
  • No question requires SQL features unsupported by SQLite

4b. Code Shape & Design Target

Existing Vocabulary

Concept Existing Name Location
Question metadata (conceptual) QuestionRecord models.py lines 224-235 (commented design)
Database identifier db_id Spider format field; used throughout download_spider_data.py
Gold SQL query (Spider) / gold_sql (QuestionRecord) student_assessment.json / models.py
Answer types integer, float, string, list, table models.py line 233, server/verifier.py docstring
Difficulty levels easy, medium, hard models.py line 234
ORM models dictionary self.db_models server/sql_environment.py line 127
Spider download function download_spider_questions() scripts/download_spider_data.py
Model generation function generate_simplified_models() scripts/generate_models_from_schema.py

Language/Framework Idioms

  • Python scripts with argparse CLI in scripts/ directory
  • JSON files for data storage (not YAML, not CSV)
  • SQLAlchemy declarative ORM for database schema (though direct SQLite may suffice for F004)
  • Pydantic models for typed data contracts (models.py)
  • HuggingFace datasets library for Spider data access
  • Type hints throughout; pathlib.Path for file operations
  • Docstrings in Google style with Args/Returns sections

Target Shape

Component Purpose Why This Boundary
scripts/curate_questions.py Main curation script: download questions for selected DBs, enrich with metadata, compute gold answers, assign difficulty, create splits, validate, output final dataset Single script matching existing scripts/ pattern; orchestrates the full pipeline
data/questions/questions_train.json Training split (70%) of enriched questions Consumed by F001 at reset(); separate file makes split explicit
data/questions/questions_eval.json Evaluation split (30%) of enriched questions Consumed by F005 Green Agent; prevents training on eval data
data/databases/{db_id}/{db_id}.sqlite SQLite database files per Spider database Required to execute gold SQL and compute gold_answer; also needed by F001 for live query execution

Abstraction Level

  • Current level: Flat scripts in scripts/. Data as JSON files in data/. No abstraction layers for data loading.
  • Recommendation: Match existing flat style. One script that does everything end-to-end. Output is JSON files. No data-loading library or ORM for the curation pipeline itself -- just sqlite3 and json. The environment's question-loading code belongs to F001, not F004.

Anti-Patterns to Avoid

  • Do not create a complex data pipeline framework (e.g., classes like QuestionCurator, DifficultyClassifier, AnswerTypeDetector). A single script with clear functions is sufficient.
  • Do not generate SQLAlchemy ORM models per database for F004 purposes. The curation script only needs sqlite3 to execute gold SQL. Whether the environment needs ORM models per DB is an F001 decision.
  • Do not embed the curation logic inside the server code. Keep it as a standalone script that produces static JSON files.
  • Do not hardcode database selection. Use a configuration list (or CLI args) so databases can be added/removed easily.

5. Constraints

Technical Constraints

Constraint Requirement Notes
SQLite compatibility All gold SQL must execute on SQLite Spider was designed for SQLite; ~99% compatible, but verify edge cases
Dataset size 100+ questions minimum Quality over quantity; user specified 100 as sufficient for MVP
Difficulty split ~40% easy / ~40% medium / ~20% hard Hard questions (4+ tables) are rarer in Spider; may need to pull from more databases
Answer types Cover integer, float, string, list at minimum Table type can be deferred per F002 user interview
No network at runtime Questions and SQLite files must be committed to repo or downloaded once Curation script runs offline after initial download

Pattern Constraints

  • Question ID format: Use {db_id}_{split}_{index} (e.g., concert_singer_train_007) to match the conceptual QuestionRecord.question_id format like spider_dev_042
  • Output JSON must match the QuestionRecord fields defined in models.py comments: question_id, question_text, database_name, gold_sql, gold_answer, answer_type, difficulty, tables_involved
  • Spider's own train/validation split should be respected as the basis (train questions -> train split, validation questions -> eval split)

Testing Constraints

Test Suite Coverage Area Notes
tests/test_smoke.py Environment instantiation, action detection, serialization Must still pass; F004 should not change server code
New: dataset validation tests All questions valid, gold SQL executes, splits clean Should be part of curation script's --validate mode or a separate test

6. Open Questions

Question Why It Matters Who Can Answer
Which specific Spider databases to include? Determines schema diversity and question count Researcher (see analysis below)
Where to get SQLite database files? Spider HuggingFace datasets may not include .sqlite files directly Technical investigation
Should the output be per-database JSON files or a single manifest? Affects how F001 loads questions Architecture decision

Spider Database Candidates (Research-Based Recommendations):

Good candidates for diverse schemas and well-formed questions based on Spider dataset characteristics:

  1. student_assessment (already have; 53 questions, 9 tables) -- education domain
  2. concert_singer (popular Spider DB; ~30 questions, 4 tables) -- entertainment domain
  3. world_1 (~30 questions, 3 tables) -- geography domain
  4. car_1 (~20 questions, 4 tables) -- automotive domain
  5. employee_hire_evaluation (~20 questions, 4 tables) -- HR domain
  6. pets_1 (~20 questions, 3 tables) -- simple schema, good for easy questions
  7. cre_Doc_Template_Mgt (~25 questions, 6 tables) -- document management domain
  8. dog_kennels (~25 questions, 7 tables) -- business domain
  9. flight_2 (~20 questions, 5 tables) -- transportation domain
  10. poker_player (~15 questions, 2 tables) -- simple, good for easy questions

These span diverse domains, table counts (2-9), and would yield ~250+ raw questions to curate down to 100+ high-quality ones.

SQLite Database Files:

The Spider dataset's SQLite files are typically obtained from the official Spider GitHub release (https://github.com/taoyds/spider), not from the HuggingFace datasets API. The xlangai/spider HuggingFace dataset contains questions but likely not the .sqlite files themselves. The curation script will need to either:

  1. Download .sqlite files from the Spider GitHub release
  2. Reconstruct databases from the schema dataset using CREATE TABLE + INSERT statements

Option 1 is more reliable. The Spider GitHub release includes a database/ directory with all SQLite files.


7. Context Sources

Source Type Notes
data/questions/student_assessment.json Code/Data Current format: raw Spider with db_id, query, question, tokenized variants. Missing: difficulty, answer_type, gold_answer, tables_involved
scripts/download_spider_data.py Code Downloads from xlangai/spider HuggingFace dataset. Supports --db-id filter and --split (train/validation)
scripts/generate_models_from_schema.py Code Downloads from richardr1126/spider-schema. Generates SQLAlchemy ORM files. Uses generate_simplified_models()
data/databases/models.py Code Hand-written SQLAlchemy ORM for student_assessment. 9 tables with relationships. This is the reference quality for ORM models
models.py Code QuestionRecord conceptual design (lines 224-235): defines target fields question_id, question_text, database_name, gold_sql, gold_answer, answer_type, difficulty, tables_involved
server/sql_environment.py Code Hardcoded to student_assessment: imports 9 specific ORM models, builds static schema string. _build_schema_description() must match ORM
server/verifier.py Code Stub with docstring defining 5 answer types: integer, float, string, list, table
server/reward.py Code Stub referencing 3-layer reward. Layer 2 needs gold_answer for progress comparison
docs/ARCHITECTURE.md Doc System map showing current single-DB architecture. Notes SQLite files not yet present
specs/FEATURES.json Doc F004 definition with user interview context
docs_draft/sql_env_project_brief.md Doc Project brief: 50-100 questions target, multi-hop insight, difficulty progression

Human Validation Checkpoint

Before proceeding to planning, please confirm:

  • System context is accurate
  • Dependencies are complete
  • Risks are identified
  • Constraints are correct
  • Open questions can be resolved

Questions for reviewer:

  1. Is anything incorrect or missing?
  2. Are there risks I haven't identified?
  3. Should we proceed to planning?

Validated by: [NAME] on [DATE]