sql_env / specs /F006-VERIFICATION_SPEC.md
hjerpe's picture
Upload folder using huggingface_hub
5dd1bb4 verified

Verification Specification

Feature: F006 Generated from: specs/F006-VERIFICATION_INPUT.json Generated: 2026-03-27


1. Unit Tests

1.1 GRPOConfig

Test Description Input Expected Category
test_grpo_config_defaults All defaults are populated when only required fields given GRPOConfig(questions_path="q.json", db_dir="dbs/", output_dir="out/") max_new_tokens=256, num_train_epochs=1, per_device_train_batch_size=2, gradient_accumulation_steps=4, learning_rate=5e-6, num_generations=4, step_budget=10, difficulty_filter=["easy","medium"], seed=42, logging_steps=10, model_name="Qwen/Qwen3-1.7B" happy
test_grpo_config_custom_values Custom values override defaults GRPOConfig(model_name="gpt2", max_new_tokens=128, ...) Fields match custom values happy
test_grpo_config_required_fields Missing required fields raise error GRPOConfig() (no questions_path, db_dir, output_dir) TypeError or validation error error
test_grpo_config_negative_batch_size Negative or zero batch size per_device_train_batch_size=0 Validation error or clear failure at training time edge
test_grpo_config_negative_learning_rate Negative learning rate learning_rate=-1.0 Validation error edge
test_grpo_config_empty_difficulty_filter Empty difficulty filter list difficulty_filter=[] Empty training set or clear error edge
test_grpo_config_seed_reproducibility Same seed produces same config state seed=42 twice Identical configs happy

Run: uv run pytest tests/unit/test_grpo_config.py -v


1.2 get_system_prompt (training/prompts.py)

Test Description Input Expected Category
test_system_prompt_returns_string Function returns non-empty string None isinstance(result, str) and len(result) > 0 happy
test_system_prompt_mentions_action_types Prompt documents all four action types None Result contains "DESCRIBE", "SAMPLE", "QUERY", "ANSWER" happy
test_system_prompt_is_deterministic Multiple calls return identical string None get_system_prompt() == get_system_prompt() happy

Run: uv run pytest tests/unit/test_prompts.py -v


1.3 format_observation (training/prompts.py)

Test Description Input Expected Category
test_format_observation_happy Formats a normal observation into user-turn string SQLObservation(question="Q?", schema_info="tables", result="25", error="", step_count=1, budget_remaining=9, action_history=["QUERY"], done=False, reward=None) Non-empty string containing question, result, and budget info happy
test_format_observation_with_error Error field is surfaced in formatted string SQLObservation(..., error="syntax error", result="") String contains "syntax error" or error indication happy
test_format_observation_done_state Terminal observation is properly formatted SQLObservation(..., done=True, reward=1.0) String includes reward/done indication happy
test_format_observation_empty_result Empty result is handled gracefully SQLObservation(..., result="", error="") Returns valid string without crashing edge
test_format_observation_long_result Very long result string SQLObservation(..., result="x" * 10000) Returns string (may be truncated); no crash edge

Run: uv run pytest tests/unit/test_prompts.py -v


1.4 parse_model_output (training/rollout.py)

Test Description Input Expected Category
test_parse_describe Parses DESCRIBE action "DESCRIBE employees" SQLAction(action_type="DESCRIBE", argument="employees") happy
test_parse_sample Parses SAMPLE action "SAMPLE departments" SQLAction(action_type="SAMPLE", argument="departments") happy
test_parse_query Parses QUERY action "QUERY SELECT COUNT(*) FROM employees" SQLAction(action_type="QUERY", argument="SELECT COUNT(*) FROM employees") happy
test_parse_answer Parses ANSWER action "ANSWER 42" SQLAction(action_type="ANSWER", argument="42") happy
test_parse_case_insensitive Case variations accepted "describe employees" or "Describe employees" Valid SQLAction with action_type="DESCRIBE" edge
test_parse_with_colon_separator Colon-separated format "QUERY: SELECT 1" SQLAction(action_type="QUERY", argument="SELECT 1") edge
test_parse_garbage_fallback Unparseable text falls back to QUERY "hello world random text" SQLAction(action_type="QUERY", argument="hello world random text") error
test_parse_empty_string_fallback Empty string falls back to QUERY "" SQLAction(action_type="QUERY", argument="") edge
test_parse_only_action_no_argument Action keyword with no argument "DESCRIBE" Fallback or empty argument handled gracefully edge
test_parse_multiline_output Model output with multiple lines "Let me think...\nQUERY SELECT 1" Extracts QUERY action or falls back to QUERY with raw text edge
test_parse_whitespace_padded Leading/trailing whitespace " ANSWER 42 " SQLAction(action_type="ANSWER", argument="42") edge

Run: uv run pytest tests/unit/test_rollout.py -v


1.5 reward_correctness (training/rewards.py)

Test Description Input Expected Category
test_correctness_correct_answer Episode ended with correct answer Completions with correct=True metadata [1.0] happy
test_correctness_wrong_answer Episode ended with wrong answer Completions with correct=False metadata [0.0] happy
test_correctness_no_answer Episode timed out without answering Completions with no answer metadata [0.0] edge
test_correctness_batch Multiple episodes in batch Mixed correct/wrong [1.0, 0.0, 1.0, 0.0] matching per-episode correctness happy
test_correctness_empty_batch Empty completions list [] [] edge
test_correctness_trl_compatible Return type is list[float] Any valid input all(isinstance(r, float) for r in result) happy

Run: uv run pytest tests/unit/test_rewards.py -v


1.6 reward_progress (training/rewards.py)

Test Description Input Expected Category
test_progress_full Maximum progress (correct answer) Completions with full progress metadata Reward in [0.0, 1.0], close to 1.0 happy
test_progress_none No progress toward answer Completions with zero progress [0.0] happy
test_progress_partial Partial progress Completions with partial closeness Reward in (0.0, 1.0) exclusive happy
test_progress_normalized Output is always in [0, 1] range Various inputs all(0.0 <= r <= 1.0 for r in result) happy
test_progress_batch Batch of varied progress Multiple episodes List of floats, length matches input happy
test_progress_trl_compatible Return type is list[float] Any valid input all(isinstance(r, float) for r in result) happy

Run: uv run pytest tests/unit/test_rewards.py -v


1.7 reward_operational (training/rewards.py)

Test Description Input Expected Category
test_operational_good_episode All steps execute OK, discover new info, no repeats Completions with exec_ok=True, new_info=True per step Positive reward happy
test_operational_all_errors Every step has execution errors Completions with exec_ok=False per step Low/negative reward error
test_operational_repeat_penalty Episode with repeated identical actions Completions with repeat=True per step Lower reward than non-repeating happy
test_operational_mixed_signals Mix of good and bad steps Varied step signals Reward between extremes happy
test_operational_single_step Episode with only one step Single step completions Valid float returned edge
test_operational_batch Multiple episodes Batch input List of floats, length matches happy
test_operational_trl_compatible Return type is list[float] Any valid input all(isinstance(r, float) for r in result) happy

Run: uv run pytest tests/unit/test_rewards.py -v


1.8 rollout_func (training/rollout.py)

Test Description Input Expected Category
test_rollout_returns_completions Returns list of dicts with expected keys Single prompt, mock model/tokenizer List of dicts with "content" and metadata keys happy
test_rollout_batch_size Output length matches input prompt count N prompts N completions returned happy
test_rollout_episode_terminates Episodes terminate within step_budget Config with step_budget=5 All episodes have <= 5 steps happy
test_rollout_metadata_present Completions include correctness, progress, operational metadata Any valid input Each completion dict has "correct", "progress", "operational" keys happy
test_rollout_unparseable_action Model generates gibberish, fallback fires Mock model returning garbage tokens Episode continues; no crash error
test_rollout_truncation Long history is truncated to system + last 3 pairs Mock model, config with step_budget=20 Context does not exceed token window edge

Run: uv run pytest tests/unit/test_rollout.py -v


2. Integration Tests

Flow: End-to-End Training Episode

Step Action Expected Verification
1 Create GRPOConfig with test questions and mock DB Config object created Config fields match inputs
2 Load questions and filter by difficulty Only easy+medium questions included Assert filtered count < total if hard questions exist
3 Call rollout_func with a real SQLEnvironment and mock model Completions returned with metadata Each completion has "content" key
4 Pass completions to reward_correctness Returns list[float] of 0.0/1.0 Length matches batch size
5 Pass completions to reward_progress Returns list[float] in [0,1] Length matches batch size
6 Pass completions to reward_operational Returns list[float] Length matches batch size

Run: uv run pytest tests/integration/test_training_pipeline.py -v


Flow: Unparseable Action Recovery

Step Action Expected Verification
1 Mock model generates unparseable text parse_model_output returns QUERY fallback action_type == "QUERY", argument == raw text
2 SQLEnvironment.step receives fallback action Returns error observation observation.error is non-empty
3 Episode continues with next step Step count increments, budget decreases step_count > previous, budget_remaining < previous

Run: uv run pytest tests/integration/test_training_pipeline.py -v


Flow: History Truncation

Step Action Expected Verification
1 Run rollout with step_budget large enough to exceed token window Truncation is triggered History contains system prompt + last 3 observation-action pairs only
2 Episode completes normally after truncation No crash; completions returned Valid completion dicts in output

Run: uv run pytest tests/integration/test_training_pipeline.py -v


3. API Tests

No API endpoints defined for F006. All interfaces are Python function calls.


4. E2E Tests

Scenario: Training Notebook Smoke Test

Setup: Test questions JSON with 2 easy questions, test SQLite database, tiny model (or mock). Actions:

  1. Instantiate GRPOConfig with test paths and minimal hyperparameters (1 epoch, batch_size=1, num_generations=2).
  2. Load model and tokenizer (use smallest available model or mock).
  3. Create GRPOTrainer with reward functions.
  4. Run trainer.train() for a single step.
  5. Verify learning curve data is logged.
  6. Run comparison episodes (before/after).

Expected:

  • Training completes without error.
  • At least one metric is logged (loss, reward).
  • Comparison episodes produce valid SQLObservation sequences.

Run: uv run pytest tests/e2e/test_training_e2e.py -v --timeout=300


Scenario: Question Filtering by Difficulty

Setup: Questions file with easy, medium, and hard questions. Actions:

  1. Create GRPOConfig with difficulty_filter=["easy"].
  2. Load and filter questions.

Expected: Only easy questions are included in training set.

Run: uv run pytest tests/e2e/test_training_e2e.py -v


5. Error Handling Tests

ModelLoadError

Test Description Trigger Expected
test_model_load_error_bad_name Invalid HuggingFace model name GRPOConfig(model_name="nonexistent/model-xyz-999") Fails fast; error message contains "nonexistent/model-xyz-999"

ActionParseError (handled via fallback)

Test Description Trigger Expected
test_action_parse_fallback_logged Unparseable action triggers warning log Model outputs "¯\_(ツ)_/¯" Warning logged; returns QUERY fallback

QuestionLoadError

Test Description Trigger Expected
test_question_load_missing_file Questions path does not exist GRPOConfig(questions_path="/nonexistent/q.json") Fails fast; error message contains the path
test_question_load_empty_file Questions file is empty JSON array questions.json containing [] Fails fast; clear error about empty questions
test_question_load_invalid_json Questions file has invalid JSON questions.json containing {broken Fails fast; JSON parse error

OOMError

Test Description Trigger Expected
test_oom_guidance OOM during training prints guidance (Cannot reliably trigger in test; verify message formatting only) Error handler message mentions reducing batch_size or num_generations

Run: uv run pytest tests/unit/test_error_handling.py -v


6. Edge Cases Checklist

  • Null/None inputs to parse_model_output
  • Empty string inputs to parse_model_output
  • Empty completions list to all reward functions
  • Single-element completions list to all reward functions
  • Very large batch (100+ prompts) to rollout_func
  • Questions file with only hard questions and difficulty_filter=["easy"] (zero matches)
  • step_budget=1 (immediate budget exhaustion after one action)
  • step_budget=0 (zero budget)
  • Unicode characters in model output (e.g., CJK, emoji)
  • Model output exceeding max_new_tokens
  • learning_rate=0.0 (no weight updates)
  • num_generations=1 (minimum GRPO completions)
  • Concurrent calls to reward functions (thread safety)
  • Database with no tables (empty schema)
  • Database with very large tables (performance)

7. Evidence Requirements

Category Evidence Type Example
Unit tests pytest output X passed
Integration pytest output X passed
Error handling pytest output X passed
E2E pytest output + training metrics 1 passed, loss=X.XX
Reward functions pytest output showing correct values reward_correctness: [1.0, 0.0]
Parse fallback pytest output + log capture WARNING: unparseable action, falling back to QUERY