Spaces:

hjerpe
/

sql_env

Running

App Files Files Community

sql_env / specs /F006-VERIFICATION_SPEC.md

hjerpe

Upload folder using huggingface_hub

5dd1bb4 verified 21 days ago

preview code

raw

history blame contribute delete

15.8 kB

Verification Specification

Feature: F006 Generated from: specs/F006-VERIFICATION_INPUT.json Generated: 2026-03-27

1. Unit Tests

1.1 GRPOConfig

Test	Description	Input	Expected	Category
test_grpo_config_defaults	All defaults are populated when only required fields given	`GRPOConfig(questions_path="q.json", db_dir="dbs/", output_dir="out/")`	`max_new_tokens=256, num_train_epochs=1, per_device_train_batch_size=2, gradient_accumulation_steps=4, learning_rate=5e-6, num_generations=4, step_budget=10, difficulty_filter=["easy","medium"], seed=42, logging_steps=10, model_name="Qwen/Qwen3-1.7B"`	happy
test_grpo_config_custom_values	Custom values override defaults	`GRPOConfig(model_name="gpt2", max_new_tokens=128, ...)`	Fields match custom values	happy
test_grpo_config_required_fields	Missing required fields raise error	`GRPOConfig()` (no questions_path, db_dir, output_dir)	`TypeError` or validation error	error
test_grpo_config_negative_batch_size	Negative or zero batch size	`per_device_train_batch_size=0`	Validation error or clear failure at training time	edge
test_grpo_config_negative_learning_rate	Negative learning rate	`learning_rate=-1.0`	Validation error	edge
test_grpo_config_empty_difficulty_filter	Empty difficulty filter list	`difficulty_filter=[]`	Empty training set or clear error	edge
test_grpo_config_seed_reproducibility	Same seed produces same config state	`seed=42` twice	Identical configs	happy

Run: uv run pytest tests/unit/test_grpo_config.py -v

1.2 get_system_prompt (training/prompts.py)

Test	Description	Input	Expected	Category
test_system_prompt_returns_string	Function returns non-empty string	None	`isinstance(result, str) and len(result) > 0`	happy
test_system_prompt_mentions_action_types	Prompt documents all four action types	None	Result contains "DESCRIBE", "SAMPLE", "QUERY", "ANSWER"	happy
test_system_prompt_is_deterministic	Multiple calls return identical string	None	`get_system_prompt() == get_system_prompt()`	happy

Run: uv run pytest tests/unit/test_prompts.py -v

1.3 format_observation (training/prompts.py)

Test	Description	Input	Expected	Category
test_format_observation_happy	Formats a normal observation into user-turn string	`SQLObservation(question="Q?", schema_info="tables", result="25", error="", step_count=1, budget_remaining=9, action_history=["QUERY"], done=False, reward=None)`	Non-empty string containing question, result, and budget info	happy
test_format_observation_with_error	Error field is surfaced in formatted string	`SQLObservation(..., error="syntax error", result="")`	String contains "syntax error" or error indication	happy
test_format_observation_done_state	Terminal observation is properly formatted	`SQLObservation(..., done=True, reward=1.0)`	String includes reward/done indication	happy
test_format_observation_empty_result	Empty result is handled gracefully	`SQLObservation(..., result="", error="")`	Returns valid string without crashing	edge
test_format_observation_long_result	Very long result string	`SQLObservation(..., result="x" * 10000)`	Returns string (may be truncated); no crash	edge

Run: uv run pytest tests/unit/test_prompts.py -v

1.4 parse_model_output (training/rollout.py)

Test	Description	Input	Expected	Category
test_parse_describe	Parses DESCRIBE action	`"DESCRIBE employees"`	`SQLAction(action_type="DESCRIBE", argument="employees")`	happy
test_parse_sample	Parses SAMPLE action	`"SAMPLE departments"`	`SQLAction(action_type="SAMPLE", argument="departments")`	happy
test_parse_query	Parses QUERY action	`"QUERY SELECT COUNT(*) FROM employees"`	`SQLAction(action_type="QUERY", argument="SELECT COUNT(*) FROM employees")`	happy
test_parse_answer	Parses ANSWER action	`"ANSWER 42"`	`SQLAction(action_type="ANSWER", argument="42")`	happy
test_parse_case_insensitive	Case variations accepted	`"describe employees"` or `"Describe employees"`	Valid SQLAction with action_type="DESCRIBE"	edge
test_parse_with_colon_separator	Colon-separated format	`"QUERY: SELECT 1"`	`SQLAction(action_type="QUERY", argument="SELECT 1")`	edge
test_parse_garbage_fallback	Unparseable text falls back to QUERY	`"hello world random text"`	`SQLAction(action_type="QUERY", argument="hello world random text")`	error
test_parse_empty_string_fallback	Empty string falls back to QUERY	`""`	`SQLAction(action_type="QUERY", argument="")`	edge
test_parse_only_action_no_argument	Action keyword with no argument	`"DESCRIBE"`	Fallback or empty argument handled gracefully	edge
test_parse_multiline_output	Model output with multiple lines	`"Let me think...\nQUERY SELECT 1"`	Extracts QUERY action or falls back to QUERY with raw text	edge
test_parse_whitespace_padded	Leading/trailing whitespace	`" ANSWER 42 "`	`SQLAction(action_type="ANSWER", argument="42")`	edge

Run: uv run pytest tests/unit/test_rollout.py -v

1.5 reward_correctness (training/rewards.py)

Test	Description	Input	Expected	Category
test_correctness_correct_answer	Episode ended with correct answer	Completions with correct=True metadata	`[1.0]`	happy
test_correctness_wrong_answer	Episode ended with wrong answer	Completions with correct=False metadata	`[0.0]`	happy
test_correctness_no_answer	Episode timed out without answering	Completions with no answer metadata	`[0.0]`	edge
test_correctness_batch	Multiple episodes in batch	Mixed correct/wrong	`[1.0, 0.0, 1.0, 0.0]` matching per-episode correctness	happy
test_correctness_empty_batch	Empty completions list	`[]`	`[]`	edge
test_correctness_trl_compatible	Return type is list[float]	Any valid input	`all(isinstance(r, float) for r in result)`	happy

Run: uv run pytest tests/unit/test_rewards.py -v

1.6 reward_progress (training/rewards.py)

Test	Description	Input	Expected	Category
test_progress_full	Maximum progress (correct answer)	Completions with full progress metadata	Reward in `[0.0, 1.0]`, close to 1.0	happy
test_progress_none	No progress toward answer	Completions with zero progress	`[0.0]`	happy
test_progress_partial	Partial progress	Completions with partial closeness	Reward in `(0.0, 1.0)` exclusive	happy
test_progress_normalized	Output is always in [0, 1] range	Various inputs	`all(0.0 <= r <= 1.0 for r in result)`	happy
test_progress_batch	Batch of varied progress	Multiple episodes	List of floats, length matches input	happy
test_progress_trl_compatible	Return type is list[float]	Any valid input	`all(isinstance(r, float) for r in result)`	happy

Run: uv run pytest tests/unit/test_rewards.py -v

1.7 reward_operational (training/rewards.py)

Test	Description	Input	Expected	Category
test_operational_good_episode	All steps execute OK, discover new info, no repeats	Completions with exec_ok=True, new_info=True per step	Positive reward	happy
test_operational_all_errors	Every step has execution errors	Completions with exec_ok=False per step	Low/negative reward	error
test_operational_repeat_penalty	Episode with repeated identical actions	Completions with repeat=True per step	Lower reward than non-repeating	happy
test_operational_mixed_signals	Mix of good and bad steps	Varied step signals	Reward between extremes	happy
test_operational_single_step	Episode with only one step	Single step completions	Valid float returned	edge
test_operational_batch	Multiple episodes	Batch input	List of floats, length matches	happy
test_operational_trl_compatible	Return type is list[float]	Any valid input	`all(isinstance(r, float) for r in result)`	happy

Run: uv run pytest tests/unit/test_rewards.py -v

1.8 rollout_func (training/rollout.py)

Test	Description	Input	Expected	Category
test_rollout_returns_completions	Returns list of dicts with expected keys	Single prompt, mock model/tokenizer	List of dicts with "content" and metadata keys	happy
test_rollout_batch_size	Output length matches input prompt count	N prompts	N completions returned	happy
test_rollout_episode_terminates	Episodes terminate within step_budget	Config with step_budget=5	All episodes have <= 5 steps	happy
test_rollout_metadata_present	Completions include correctness, progress, operational metadata	Any valid input	Each completion dict has "correct", "progress", "operational" keys	happy
test_rollout_unparseable_action	Model generates gibberish, fallback fires	Mock model returning garbage tokens	Episode continues; no crash	error
test_rollout_truncation	Long history is truncated to system + last 3 pairs	Mock model, config with step_budget=20	Context does not exceed token window	edge

Run: uv run pytest tests/unit/test_rollout.py -v

2. Integration Tests

Flow: End-to-End Training Episode

Step	Action	Expected	Verification
1	Create GRPOConfig with test questions and mock DB	Config object created	Config fields match inputs
2	Load questions and filter by difficulty	Only easy+medium questions included	Assert filtered count < total if hard questions exist
3	Call rollout_func with a real SQLEnvironment and mock model	Completions returned with metadata	Each completion has "content" key
4	Pass completions to reward_correctness	Returns list[float] of 0.0/1.0	Length matches batch size
5	Pass completions to reward_progress	Returns list[float] in [0,1]	Length matches batch size
6	Pass completions to reward_operational	Returns list[float]	Length matches batch size

Run: uv run pytest tests/integration/test_training_pipeline.py -v

Flow: Unparseable Action Recovery

Step	Action	Expected	Verification
1	Mock model generates unparseable text	parse_model_output returns QUERY fallback	action_type == "QUERY", argument == raw text
2	SQLEnvironment.step receives fallback action	Returns error observation	observation.error is non-empty
3	Episode continues with next step	Step count increments, budget decreases	step_count > previous, budget_remaining < previous

Run: uv run pytest tests/integration/test_training_pipeline.py -v

Flow: History Truncation

Step	Action	Expected	Verification
1	Run rollout with step_budget large enough to exceed token window	Truncation is triggered	History contains system prompt + last 3 observation-action pairs only
2	Episode completes normally after truncation	No crash; completions returned	Valid completion dicts in output

Run: uv run pytest tests/integration/test_training_pipeline.py -v

3. API Tests

No API endpoints defined for F006. All interfaces are Python function calls.

4. E2E Tests

Scenario: Training Notebook Smoke Test

Setup: Test questions JSON with 2 easy questions, test SQLite database, tiny model (or mock). Actions:

Instantiate GRPOConfig with test paths and minimal hyperparameters (1 epoch, batch_size=1, num_generations=2).
Load model and tokenizer (use smallest available model or mock).
Create GRPOTrainer with reward functions.
Run trainer.train() for a single step.
Verify learning curve data is logged.
Run comparison episodes (before/after).

Expected:

Training completes without error.
At least one metric is logged (loss, reward).
Comparison episodes produce valid SQLObservation sequences.

Run: uv run pytest tests/e2e/test_training_e2e.py -v --timeout=300

Scenario: Question Filtering by Difficulty

Setup: Questions file with easy, medium, and hard questions. Actions:

Create GRPOConfig with difficulty_filter=["easy"].
Load and filter questions.

Expected: Only easy questions are included in training set.

Run: uv run pytest tests/e2e/test_training_e2e.py -v

5. Error Handling Tests

ModelLoadError

Test	Description	Trigger	Expected
test_model_load_error_bad_name	Invalid HuggingFace model name	`GRPOConfig(model_name="nonexistent/model-xyz-999")`	Fails fast; error message contains "nonexistent/model-xyz-999"

ActionParseError (handled via fallback)

Test	Description	Trigger	Expected
test_action_parse_fallback_logged	Unparseable action triggers warning log	Model outputs `"¯\_(ツ)_/¯"`	Warning logged; returns QUERY fallback

QuestionLoadError

Test	Description	Trigger	Expected
test_question_load_missing_file	Questions path does not exist	`GRPOConfig(questions_path="/nonexistent/q.json")`	Fails fast; error message contains the path
test_question_load_empty_file	Questions file is empty JSON array	`questions.json` containing `[]`	Fails fast; clear error about empty questions
test_question_load_invalid_json	Questions file has invalid JSON	`questions.json` containing `{broken`	Fails fast; JSON parse error

OOMError

Test	Description	Trigger	Expected
test_oom_guidance	OOM during training prints guidance	(Cannot reliably trigger in test; verify message formatting only)	Error handler message mentions reducing batch_size or num_generations

Run: uv run pytest tests/unit/test_error_handling.py -v

6. Edge Cases Checklist

Null/None inputs to parse_model_output
Empty string inputs to parse_model_output
Empty completions list to all reward functions
Single-element completions list to all reward functions
Very large batch (100+ prompts) to rollout_func
Questions file with only hard questions and difficulty_filter=["easy"] (zero matches)
step_budget=1 (immediate budget exhaustion after one action)
step_budget=0 (zero budget)
Unicode characters in model output (e.g., CJK, emoji)
Model output exceeding max_new_tokens
learning_rate=0.0 (no weight updates)
num_generations=1 (minimum GRPO completions)
Concurrent calls to reward functions (thread safety)
Database with no tables (empty schema)
Database with very large tables (performance)

7. Evidence Requirements

Category	Evidence Type	Example
Unit tests	pytest output	`X passed`
Integration	pytest output	`X passed`
Error handling	pytest output	`X passed`
E2E	pytest output + training metrics	`1 passed, loss=X.XX`
Reward functions	pytest output showing correct values	`reward_correctness: [1.0, 0.0]`
Parse fallback	pytest output + log capture	`WARNING: unparseable action, falling back to QUERY`