Spaces:

hjerpe
/

sql_env

Sleeping

App Files Files Community

sql_env / specs /F004-RESEARCH_SUMMARY.md

hjerpe

Upload folder using huggingface_hub

5dd1bb4 verified 23 days ago

preview code

raw

history blame contribute delete

19 kB

Research Summary

Project: SQLEnv - Question Dataset Expansion Change: F004 - Expand from 53 questions (one DB) to 100+ questions across 5-10 Spider databases with difficulty labels, answer_type metadata, gold_answer fields, and train/eval split Date: 2026-03-24 Status: Draft

1. Change Overview

What We're Changing

Expanding the question dataset from the current 53 Spider questions for a single database (student_assessment) to 100+ curated questions spanning 5-10 Spider databases. Each question will be enriched with metadata fields: difficulty (easy/medium/hard at 40/40/20 split), answer_type (integer/float/string/list/table), gold_answer (pre-computed), and tables_involved. The dataset will be split into train (70%) and eval (30%) partitions.

Why We're Changing It

Training on a single database schema risks overfitting. The agent needs diverse schemas, question patterns, and difficulty levels to develop generalizable SQL exploration strategies. Pre-computed gold answers avoid re-executing gold SQL every RL episode, improving training throughput.

Success Criteria

100+ questions across 5-10 Spider databases
Difficulty distribution: ~40% easy (1-2 tables), ~40% medium (2-3 tables), ~20% hard (4+ tables)
Every question has: gold_answer, answer_type, difficulty, tables_involved
Train/eval split at 70/30 with no cross-contamination
No questions requiring SQL features unsupported by SQLite
Diverse answer types (integer, float, string, list) and SQL patterns (aggregation, joins, subqueries, grouping)

2. System Context

Current Behavior

The system stores 53 raw Spider questions in data/questions/student_assessment.json. These questions use the Spider dataset's native format with fields: db_id, query (gold SQL), question (natural language), query_toks, query_toks_no_value, and question_toks. There are no difficulty, answer_type, gold_answer, or tables_involved fields. There is no train/eval split.

The SQLEnvironment class in server/sql_environment.py currently hardcodes the student_assessment schema: it imports all 9 ORM models by name and builds a static schema description string in _build_schema_description(). Questions are not yet loaded or used in the environment loop (the conceptual QuestionRecord is defined in comments in models.py but not implemented).

Architecture Context

data/
  questions/
    student_assessment.json    <-- Current: raw Spider format, 53 questions
    (new files per database)   <-- Target: enriched format, 100+ questions
  databases/
    models.py                  <-- Current: student_assessment ORM only
    (new models per database)  <-- Target: ORM per database OR direct SQLite

scripts/
  download_spider_data.py      <-- Downloads questions from HuggingFace
  generate_models_from_schema.py  <-- Auto-generates ORM from Spider schema

server/
  sql_environment.py           <-- Hardcoded to student_assessment schema
  verifier.py                  <-- Stub; will use answer_type for comparison

models.py                      <-- QuestionRecord conceptual design (comments)

F004 is a data-layer feature that produces the enriched question files. It does not implement the environment's question-loading logic (that belongs to F001 Core Environment Loop), but it must produce data in the format that F001/F002/F003 will consume.

Entry Points

Entry Point	Trigger	Current Flow
`data/questions/student_assessment.json`	Read at import/reset	Raw Spider format; 53 entries with `db_id`, `query`, `question`, tokenized variants
`scripts/download_spider_data.py`	Manual CLI invocation	Downloads Spider questions from `xlangai/spider` HuggingFace dataset, filters by `db_id`, saves raw JSON
`scripts/generate_models_from_schema.py`	Manual CLI invocation	Downloads schema from `richardr1126/spider-schema`, generates SQLAlchemy model `.py` files
`data/databases/models.py`	Imported by `sql_environment.py`	Hand-written SQLAlchemy ORM for student_assessment (9 tables)

Data Flow

Data	Source	Shape/Type	Destination
Raw Spider questions	HuggingFace `xlangai/spider`	`[{db_id, query, question, query_toks, query_toks_no_value, question_toks}]`	`data/questions/{db_id}.json`
Spider schema	HuggingFace `richardr1126/spider-schema`	`[{db_id, table: [{name, columns: [{name, type}]}], foreign_keys}]`	`data/models/{db_id}.py` (generated ORM)
Enriched questions (target)	Curation script (new)	`[{id, db_id, question, gold_sql, gold_answer, answer_type, difficulty, tables_involved, split}]`	`data/questions/` or single manifest file
SQLite database files	Spider dataset (to be downloaded)	`.sqlite` files	`data/databases/{db_id}/{db_id}.sqlite`

3. Dependencies

Code We Depend On

Dependency	What We Use	Risk if Changed
`datasets` (HuggingFace)	`load_dataset("xlangai/spider")` for questions, `load_dataset("richardr1126/spider-schema")` for schemas	Dataset API changes could break download scripts
Spider dataset (`xlangai/spider`)	Raw questions with gold SQL	Dataset structure is stable (academic benchmark)
Spider schema dataset (`richardr1126/spider-schema`)	Table/column definitions for ORM generation	Third-party dataset; less stable than official Spider
`sqlite3` (stdlib)	Execute gold SQL to compute `gold_answer`	Stable (stdlib)
SQLAlchemy	ORM model definitions used by environment	Already a project dependency

Code That Depends On Us

Dependent	How They Use Us	Impact of Our Change
F001 (Core Environment Loop)	Loads questions from JSON at `reset()`, selects question, opens SQLite database	Must produce questions in format matching `QuestionRecord` conceptual design in `models.py`
F002 (Answer Verification)	Uses `answer_type` and `gold_answer` to verify agent submissions	Must correctly classify answer types and pre-compute gold answers
F003 (Dense Reward)	Uses `gold_answer` for progress-to-target comparison (Layer 2)	Gold answers must be deterministic and correct
F006 (GRPO Training)	Uses train split for training, eval split for evaluation	Train/eval split must be clean
`server/sql_environment.py`	Currently hardcodes `student_assessment` ORM imports and schema description	Multi-database support will require changes to environment (F001 scope), but F004 must provide the data

External Systems

System	Integration Point	Considerations
HuggingFace Hub	`datasets.load_dataset()`	Network required for initial download; cache locally
Spider SQLite databases	Direct file access	No `.sqlite` files exist in repo yet; must be downloaded or created

4. Risks & Edge Cases

Identified Risks

Risk	Likelihood	Impact	Mitigation
Gold SQL produces different results across SQLite versions	Low	Incorrect gold_answer, bad reward signal	Pin SQLite version; validate gold answers on target SQLite
Some Spider questions use SQL features not in SQLite (e.g., `ILIKE`, `DATEDIFF`)	Medium	Questions fail to execute	Filter questions by executing gold SQL against actual SQLite; exclude failures
Spider database `.sqlite` files not available via HuggingFace datasets API	Medium	Cannot execute gold SQL to compute gold_answer	Download `.sqlite` files from Spider GitHub repo or reconstruct from schema
Ambiguous gold answers (queries returning non-deterministic order)	Medium	Reward gives false negatives	For list/table answer types, use order-insensitive comparison; flag and review ORDER BY-dependent queries
Difficulty classification is subjective	Low	Uneven difficulty distribution	Use heuristic: count distinct tables in gold SQL to assign difficulty
Train/eval data leakage (same question rephrased across Spider train/validation)	Low	Overfitting on eval set	Use Spider's own train/validation split as basis; additionally deduplicate by gold SQL

Edge Cases to Handle

Edge Case	Current Behavior	Required Behavior
Gold SQL returns empty result set	N/A	Include as valid question; gold_answer = empty list/table; answer_type = "list" or "table"
Gold SQL returns NULL values	N/A	Normalize NULLs to Python None; handle in answer_type classification
Multiple valid gold SQLs for same question	Only one gold SQL per Spider question	Accept Spider's single gold SQL; note that alternative SQL may produce same answer
Database has no questions in Spider	N/A	Skip database during curation
Question text contains typos/ambiguity (Spider known issue)	N/A	Accept as-is for MVP; flag obvious issues

Invariants to Preserve

Every question in the dataset has all required fields populated (no partial records)
Every gold_sql executes successfully against its corresponding SQLite database
Every gold_answer matches the result of executing gold_sql
Train and eval splits have no overlapping question IDs
Difficulty distribution approximates 40/40/20 (easy/medium/hard)
No question requires SQL features unsupported by SQLite

4b. Code Shape & Design Target

Existing Vocabulary

Concept	Existing Name	Location
Question metadata (conceptual)	`QuestionRecord`	`models.py` lines 224-235 (commented design)
Database identifier	`db_id`	Spider format field; used throughout `download_spider_data.py`
Gold SQL	`query` (Spider) / `gold_sql` (QuestionRecord)	`student_assessment.json` / `models.py`
Answer types	`integer, float, string, list, table`	`models.py` line 233, `server/verifier.py` docstring
Difficulty levels	`easy, medium, hard`	`models.py` line 234
ORM models dictionary	`self.db_models`	`server/sql_environment.py` line 127
Spider download function	`download_spider_questions()`	`scripts/download_spider_data.py`
Model generation function	`generate_simplified_models()`	`scripts/generate_models_from_schema.py`

Language/Framework Idioms

Python scripts with argparse CLI in scripts/ directory
JSON files for data storage (not YAML, not CSV)
SQLAlchemy declarative ORM for database schema (though direct SQLite may suffice for F004)
Pydantic models for typed data contracts (models.py)
HuggingFace datasets library for Spider data access
Type hints throughout; pathlib.Path for file operations
Docstrings in Google style with Args/Returns sections

Target Shape

Component	Purpose	Why This Boundary
`scripts/curate_questions.py`	Main curation script: download questions for selected DBs, enrich with metadata, compute gold answers, assign difficulty, create splits, validate, output final dataset	Single script matching existing `scripts/` pattern; orchestrates the full pipeline
`data/questions/questions_train.json`	Training split (70%) of enriched questions	Consumed by F001 at reset(); separate file makes split explicit
`data/questions/questions_eval.json`	Evaluation split (30%) of enriched questions	Consumed by F005 Green Agent; prevents training on eval data
`data/databases/{db_id}/{db_id}.sqlite`	SQLite database files per Spider database	Required to execute gold SQL and compute gold_answer; also needed by F001 for live query execution

Abstraction Level

Current level: Flat scripts in scripts/. Data as JSON files in data/. No abstraction layers for data loading.
Recommendation: Match existing flat style. One script that does everything end-to-end. Output is JSON files. No data-loading library or ORM for the curation pipeline itself -- just sqlite3 and json. The environment's question-loading code belongs to F001, not F004.

Anti-Patterns to Avoid

Do not create a complex data pipeline framework (e.g., classes like QuestionCurator, DifficultyClassifier, AnswerTypeDetector). A single script with clear functions is sufficient.
Do not generate SQLAlchemy ORM models per database for F004 purposes. The curation script only needs sqlite3 to execute gold SQL. Whether the environment needs ORM models per DB is an F001 decision.
Do not embed the curation logic inside the server code. Keep it as a standalone script that produces static JSON files.
Do not hardcode database selection. Use a configuration list (or CLI args) so databases can be added/removed easily.

5. Constraints

Technical Constraints

Constraint	Requirement	Notes
SQLite compatibility	All gold SQL must execute on SQLite	Spider was designed for SQLite; ~99% compatible, but verify edge cases
Dataset size	100+ questions minimum	Quality over quantity; user specified 100 as sufficient for MVP
Difficulty split	~40% easy / ~40% medium / ~20% hard	Hard questions (4+ tables) are rarer in Spider; may need to pull from more databases
Answer types	Cover integer, float, string, list at minimum	Table type can be deferred per F002 user interview
No network at runtime	Questions and SQLite files must be committed to repo or downloaded once	Curation script runs offline after initial download

Pattern Constraints

Question ID format: Use {db_id}_{split}_{index} (e.g., concert_singer_train_007) to match the conceptual QuestionRecord.question_id format like spider_dev_042
Output JSON must match the QuestionRecord fields defined in models.py comments: question_id, question_text, database_name, gold_sql, gold_answer, answer_type, difficulty, tables_involved
Spider's own train/validation split should be respected as the basis (train questions -> train split, validation questions -> eval split)

Testing Constraints

Test Suite	Coverage Area	Notes
`tests/test_smoke.py`	Environment instantiation, action detection, serialization	Must still pass; F004 should not change server code
New: dataset validation tests	All questions valid, gold SQL executes, splits clean	Should be part of curation script's `--validate` mode or a separate test

6. Open Questions

Question	Why It Matters	Who Can Answer
Which specific Spider databases to include?	Determines schema diversity and question count	Researcher (see analysis below)
Where to get SQLite database files?	Spider HuggingFace datasets may not include `.sqlite` files directly	Technical investigation
Should the output be per-database JSON files or a single manifest?	Affects how F001 loads questions	Architecture decision

Spider Database Candidates (Research-Based Recommendations):

Good candidates for diverse schemas and well-formed questions based on Spider dataset characteristics:

student_assessment (already have; 53 questions, 9 tables) -- education domain
concert_singer (popular Spider DB; ~30 questions, 4 tables) -- entertainment domain
world_1 (~30 questions, 3 tables) -- geography domain
car_1 (~20 questions, 4 tables) -- automotive domain
employee_hire_evaluation (~20 questions, 4 tables) -- HR domain
pets_1 (~20 questions, 3 tables) -- simple schema, good for easy questions
cre_Doc_Template_Mgt (~25 questions, 6 tables) -- document management domain
dog_kennels (~25 questions, 7 tables) -- business domain
flight_2 (~20 questions, 5 tables) -- transportation domain
poker_player (~15 questions, 2 tables) -- simple, good for easy questions

These span diverse domains, table counts (2-9), and would yield ~250+ raw questions to curate down to 100+ high-quality ones.

SQLite Database Files:

The Spider dataset's SQLite files are typically obtained from the official Spider GitHub release (https://github.com/taoyds/spider), not from the HuggingFace datasets API. The xlangai/spider HuggingFace dataset contains questions but likely not the .sqlite files themselves. The curation script will need to either:

Download .sqlite files from the Spider GitHub release
Reconstruct databases from the schema dataset using CREATE TABLE + INSERT statements

Option 1 is more reliable. The Spider GitHub release includes a database/ directory with all SQLite files.

7. Context Sources

Source	Type	Notes
`data/questions/student_assessment.json`	Code/Data	Current format: raw Spider with `db_id`, `query`, `question`, tokenized variants. Missing: difficulty, answer_type, gold_answer, tables_involved
`scripts/download_spider_data.py`	Code	Downloads from `xlangai/spider` HuggingFace dataset. Supports `--db-id` filter and `--split` (train/validation)
`scripts/generate_models_from_schema.py`	Code	Downloads from `richardr1126/spider-schema`. Generates SQLAlchemy ORM files. Uses `generate_simplified_models()`
`data/databases/models.py`	Code	Hand-written SQLAlchemy ORM for student_assessment. 9 tables with relationships. This is the reference quality for ORM models
`models.py`	Code	`QuestionRecord` conceptual design (lines 224-235): defines target fields `question_id`, `question_text`, `database_name`, `gold_sql`, `gold_answer`, `answer_type`, `difficulty`, `tables_involved`
`server/sql_environment.py`	Code	Hardcoded to student_assessment: imports 9 specific ORM models, builds static schema string. `_build_schema_description()` must match ORM
`server/verifier.py`	Code	Stub with docstring defining 5 answer types: integer, float, string, list, table
`server/reward.py`	Code	Stub referencing 3-layer reward. Layer 2 needs gold_answer for progress comparison
`docs/ARCHITECTURE.md`	Doc	System map showing current single-DB architecture. Notes SQLite files not yet present
`specs/FEATURES.json`	Doc	F004 definition with user interview context
`docs_draft/sql_env_project_brief.md`	Doc	Project brief: 50-100 questions target, multi-hop insight, difficulty progression

Human Validation Checkpoint

Before proceeding to planning, please confirm:

System context is accurate
Dependencies are complete
Risks are identified
Constraints are correct
Open questions can be resolved

Questions for reviewer:

Is anything incorrect or missing?
Are there risks I haven't identified?
Should we proceed to planning?

Validated by: [NAME] on [DATE]