Spaces:

hjerpe
/

sql_env

Running

App Files Files Community

sql_env / specs /F004-CLARIFICATION_QUESTIONS.md

hjerpe

Upload folder using huggingface_hub

5dd1bb4 verified 21 days ago

preview code

raw

history blame contribute delete

4.61 kB

Clarification Questions: F004 - Question Dataset Expansion

Generated: 2026-03-24 Research Summary: specs/F004-RESEARCH_SUMMARY.md Status: Skipped (defaults used)

Questions

Researcher: Include only genuine ambiguities that emerged from research and are NOT already answered by the user interview context. Each question MUST cite a specific research finding. Include all questions that survive the skip-if-covered and citation filters -- do not impose an arbitrary cap. The structured format (defaults + impact) keeps scan time low regardless of count.

Impact calibration (controls Auto-Proceed Gate): The "Impact if Wrong" value directly determines whether the checkpoint blocks fast-approve. High = wrong choice requires rearchitecting, data loss, or security risk (blocks fast-approve). Medium = contained rework >1hr (auto-proceeds with default). Low = minor implementation detail, easily changed (auto-proceeds with default). Heuristic: If the question is about HOW to implement, not WHAT, it's almost always Low or Medium.

#	Category	Question	Default Assumption	Impact if Wrong
1	Scope	Research found that the current `server/sql_environment.py` hardcodes 9 specific ORM model imports and `_build_schema_description()` for student_assessment only. Should F004 also produce per-database SQLAlchemy ORM model files (via `generate_models_from_schema.py`), or only the enriched question JSON + SQLite files, leaving ORM generation to F001?	F004 produces enriched question JSON + SQLite files only. ORM model generation (if needed) is deferred to F001 since the environment may work directly with SQLite via `sqlite3` module instead of SQLAlchemy ORM for multi-database support.	Medium
2	Constraints	Research found no `.sqlite` database files in the repo (`docs/ARCHITECTURE.md` confirms: "SQLite database files -- Phase 3 -- queries currently go through Ollama, not executed locally"). The Spider `.sqlite` files are typically ~50-200MB total from the official GitHub release. Should these be committed to the repo, or downloaded on-demand by the curation script and gitignored?	Download on-demand via the curation script and gitignore the `.sqlite` files. Add a `scripts/download_spider_databases.py` or a `--download-dbs` flag to the curation script. Commit only the enriched question JSON files (small).	Medium
3	Scope	Research found the `QuestionRecord` design in `models.py` (line 228) uses format `spider_dev_042` for `question_id`, but the current question data uses Spider's native format with no explicit ID field. Should the output format exactly match the `QuestionRecord` field names (`question_id`, `question_text`, `database_name`, `gold_sql`, `gold_answer`, `answer_type`, `difficulty`, `tables_involved`) or use a different schema?	Use exactly the `QuestionRecord` field names from `models.py` lines 228-235, plus add `split` field ("train"/"eval"). Drop Spider-native fields (`query_toks`, `query_toks_no_value`, `question_toks`) as they are not referenced anywhere in the server code.	Low

Instructions for Human

Answer any questions where the default assumption does not match your intent
Leave blank to accept the default assumption
Type "skip" to skip all questions and proceed with all defaults

Instructions for Researcher

Skip-if-covered rule: Before generating a question, check the user interview context passed in the prompt. If the user interview already answers the question (even partially), do not include it. Only generate questions for genuine unknowns that emerged from codebase research.

Citation rule: Each question must reference a specific finding from your research (e.g., "Research found 3 different auth patterns in the codebase" or "The existing API uses X but the spec implies Y"). Questions without research backing should be dropped -- they are likely obvious or inferable.

Zero-questions path: If all potential questions are covered by the user interview or are inferable from the codebase, do not create this file. The pipeline will proceed without it (fast-approve path).

Clarification Questions: F004 - Question Dataset Expansion

Questions

Categories

Instructions for Human

Instructions for Researcher