Verification Specification
Feature: F006 Generated from: specs/F006-VERIFICATION_INPUT.json Generated: 2026-03-27
1. Unit Tests
1.1 GRPOConfig
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_grpo_config_defaults | All defaults are populated when only required fields given | GRPOConfig(questions_path="q.json", db_dir="dbs/", output_dir="out/") |
max_new_tokens=256, num_train_epochs=1, per_device_train_batch_size=2, gradient_accumulation_steps=4, learning_rate=5e-6, num_generations=4, step_budget=10, difficulty_filter=["easy","medium"], seed=42, logging_steps=10, model_name="Qwen/Qwen3-1.7B" |
happy |
| test_grpo_config_custom_values | Custom values override defaults | GRPOConfig(model_name="gpt2", max_new_tokens=128, ...) |
Fields match custom values | happy |
| test_grpo_config_required_fields | Missing required fields raise error | GRPOConfig() (no questions_path, db_dir, output_dir) |
TypeError or validation error |
error |
| test_grpo_config_negative_batch_size | Negative or zero batch size | per_device_train_batch_size=0 |
Validation error or clear failure at training time | edge |
| test_grpo_config_negative_learning_rate | Negative learning rate | learning_rate=-1.0 |
Validation error | edge |
| test_grpo_config_empty_difficulty_filter | Empty difficulty filter list | difficulty_filter=[] |
Empty training set or clear error | edge |
| test_grpo_config_seed_reproducibility | Same seed produces same config state | seed=42 twice |
Identical configs | happy |
Run: uv run pytest tests/unit/test_grpo_config.py -v
1.2 get_system_prompt (training/prompts.py)
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_system_prompt_returns_string | Function returns non-empty string | None | isinstance(result, str) and len(result) > 0 |
happy |
| test_system_prompt_mentions_action_types | Prompt documents all four action types | None | Result contains "DESCRIBE", "SAMPLE", "QUERY", "ANSWER" | happy |
| test_system_prompt_is_deterministic | Multiple calls return identical string | None | get_system_prompt() == get_system_prompt() |
happy |
Run: uv run pytest tests/unit/test_prompts.py -v
1.3 format_observation (training/prompts.py)
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_format_observation_happy | Formats a normal observation into user-turn string | SQLObservation(question="Q?", schema_info="tables", result="25", error="", step_count=1, budget_remaining=9, action_history=["QUERY"], done=False, reward=None) |
Non-empty string containing question, result, and budget info | happy |
| test_format_observation_with_error | Error field is surfaced in formatted string | SQLObservation(..., error="syntax error", result="") |
String contains "syntax error" or error indication | happy |
| test_format_observation_done_state | Terminal observation is properly formatted | SQLObservation(..., done=True, reward=1.0) |
String includes reward/done indication | happy |
| test_format_observation_empty_result | Empty result is handled gracefully | SQLObservation(..., result="", error="") |
Returns valid string without crashing | edge |
| test_format_observation_long_result | Very long result string | SQLObservation(..., result="x" * 10000) |
Returns string (may be truncated); no crash | edge |
Run: uv run pytest tests/unit/test_prompts.py -v
1.4 parse_model_output (training/rollout.py)
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_parse_describe | Parses DESCRIBE action | "DESCRIBE employees" |
SQLAction(action_type="DESCRIBE", argument="employees") |
happy |
| test_parse_sample | Parses SAMPLE action | "SAMPLE departments" |
SQLAction(action_type="SAMPLE", argument="departments") |
happy |
| test_parse_query | Parses QUERY action | "QUERY SELECT COUNT(*) FROM employees" |
SQLAction(action_type="QUERY", argument="SELECT COUNT(*) FROM employees") |
happy |
| test_parse_answer | Parses ANSWER action | "ANSWER 42" |
SQLAction(action_type="ANSWER", argument="42") |
happy |
| test_parse_case_insensitive | Case variations accepted | "describe employees" or "Describe employees" |
Valid SQLAction with action_type="DESCRIBE" | edge |
| test_parse_with_colon_separator | Colon-separated format | "QUERY: SELECT 1" |
SQLAction(action_type="QUERY", argument="SELECT 1") |
edge |
| test_parse_garbage_fallback | Unparseable text falls back to QUERY | "hello world random text" |
SQLAction(action_type="QUERY", argument="hello world random text") |
error |
| test_parse_empty_string_fallback | Empty string falls back to QUERY | "" |
SQLAction(action_type="QUERY", argument="") |
edge |
| test_parse_only_action_no_argument | Action keyword with no argument | "DESCRIBE" |
Fallback or empty argument handled gracefully | edge |
| test_parse_multiline_output | Model output with multiple lines | "Let me think...\nQUERY SELECT 1" |
Extracts QUERY action or falls back to QUERY with raw text | edge |
| test_parse_whitespace_padded | Leading/trailing whitespace | " ANSWER 42 " |
SQLAction(action_type="ANSWER", argument="42") |
edge |
Run: uv run pytest tests/unit/test_rollout.py -v
1.5 reward_correctness (training/rewards.py)
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_correctness_correct_answer | Episode ended with correct answer | Completions with correct=True metadata | [1.0] |
happy |
| test_correctness_wrong_answer | Episode ended with wrong answer | Completions with correct=False metadata | [0.0] |
happy |
| test_correctness_no_answer | Episode timed out without answering | Completions with no answer metadata | [0.0] |
edge |
| test_correctness_batch | Multiple episodes in batch | Mixed correct/wrong | [1.0, 0.0, 1.0, 0.0] matching per-episode correctness |
happy |
| test_correctness_empty_batch | Empty completions list | [] |
[] |
edge |
| test_correctness_trl_compatible | Return type is list[float] | Any valid input | all(isinstance(r, float) for r in result) |
happy |
Run: uv run pytest tests/unit/test_rewards.py -v
1.6 reward_progress (training/rewards.py)
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_progress_full | Maximum progress (correct answer) | Completions with full progress metadata | Reward in [0.0, 1.0], close to 1.0 |
happy |
| test_progress_none | No progress toward answer | Completions with zero progress | [0.0] |
happy |
| test_progress_partial | Partial progress | Completions with partial closeness | Reward in (0.0, 1.0) exclusive |
happy |
| test_progress_normalized | Output is always in [0, 1] range | Various inputs | all(0.0 <= r <= 1.0 for r in result) |
happy |
| test_progress_batch | Batch of varied progress | Multiple episodes | List of floats, length matches input | happy |
| test_progress_trl_compatible | Return type is list[float] | Any valid input | all(isinstance(r, float) for r in result) |
happy |
Run: uv run pytest tests/unit/test_rewards.py -v
1.7 reward_operational (training/rewards.py)
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_operational_good_episode | All steps execute OK, discover new info, no repeats | Completions with exec_ok=True, new_info=True per step | Positive reward | happy |
| test_operational_all_errors | Every step has execution errors | Completions with exec_ok=False per step | Low/negative reward | error |
| test_operational_repeat_penalty | Episode with repeated identical actions | Completions with repeat=True per step | Lower reward than non-repeating | happy |
| test_operational_mixed_signals | Mix of good and bad steps | Varied step signals | Reward between extremes | happy |
| test_operational_single_step | Episode with only one step | Single step completions | Valid float returned | edge |
| test_operational_batch | Multiple episodes | Batch input | List of floats, length matches | happy |
| test_operational_trl_compatible | Return type is list[float] | Any valid input | all(isinstance(r, float) for r in result) |
happy |
Run: uv run pytest tests/unit/test_rewards.py -v
1.8 rollout_func (training/rollout.py)
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_rollout_returns_completions | Returns list of dicts with expected keys | Single prompt, mock model/tokenizer | List of dicts with "content" and metadata keys | happy |
| test_rollout_batch_size | Output length matches input prompt count | N prompts | N completions returned | happy |
| test_rollout_episode_terminates | Episodes terminate within step_budget | Config with step_budget=5 | All episodes have <= 5 steps | happy |
| test_rollout_metadata_present | Completions include correctness, progress, operational metadata | Any valid input | Each completion dict has "correct", "progress", "operational" keys | happy |
| test_rollout_unparseable_action | Model generates gibberish, fallback fires | Mock model returning garbage tokens | Episode continues; no crash | error |
| test_rollout_truncation | Long history is truncated to system + last 3 pairs | Mock model, config with step_budget=20 | Context does not exceed token window | edge |
Run: uv run pytest tests/unit/test_rollout.py -v
2. Integration Tests
Flow: End-to-End Training Episode
| Step | Action | Expected | Verification |
|---|---|---|---|
| 1 | Create GRPOConfig with test questions and mock DB | Config object created | Config fields match inputs |
| 2 | Load questions and filter by difficulty | Only easy+medium questions included | Assert filtered count < total if hard questions exist |
| 3 | Call rollout_func with a real SQLEnvironment and mock model | Completions returned with metadata | Each completion has "content" key |
| 4 | Pass completions to reward_correctness | Returns list[float] of 0.0/1.0 | Length matches batch size |
| 5 | Pass completions to reward_progress | Returns list[float] in [0,1] | Length matches batch size |
| 6 | Pass completions to reward_operational | Returns list[float] | Length matches batch size |
Run: uv run pytest tests/integration/test_training_pipeline.py -v
Flow: Unparseable Action Recovery
| Step | Action | Expected | Verification |
|---|---|---|---|
| 1 | Mock model generates unparseable text | parse_model_output returns QUERY fallback | action_type == "QUERY", argument == raw text |
| 2 | SQLEnvironment.step receives fallback action | Returns error observation | observation.error is non-empty |
| 3 | Episode continues with next step | Step count increments, budget decreases | step_count > previous, budget_remaining < previous |
Run: uv run pytest tests/integration/test_training_pipeline.py -v
Flow: History Truncation
| Step | Action | Expected | Verification |
|---|---|---|---|
| 1 | Run rollout with step_budget large enough to exceed token window | Truncation is triggered | History contains system prompt + last 3 observation-action pairs only |
| 2 | Episode completes normally after truncation | No crash; completions returned | Valid completion dicts in output |
Run: uv run pytest tests/integration/test_training_pipeline.py -v
3. API Tests
No API endpoints defined for F006. All interfaces are Python function calls.
4. E2E Tests
Scenario: Training Notebook Smoke Test
Setup: Test questions JSON with 2 easy questions, test SQLite database, tiny model (or mock). Actions:
- Instantiate GRPOConfig with test paths and minimal hyperparameters (1 epoch, batch_size=1, num_generations=2).
- Load model and tokenizer (use smallest available model or mock).
- Create GRPOTrainer with reward functions.
- Run trainer.train() for a single step.
- Verify learning curve data is logged.
- Run comparison episodes (before/after).
Expected:
- Training completes without error.
- At least one metric is logged (loss, reward).
- Comparison episodes produce valid SQLObservation sequences.
Run: uv run pytest tests/e2e/test_training_e2e.py -v --timeout=300
Scenario: Question Filtering by Difficulty
Setup: Questions file with easy, medium, and hard questions. Actions:
- Create GRPOConfig with
difficulty_filter=["easy"]. - Load and filter questions.
Expected: Only easy questions are included in training set.
Run: uv run pytest tests/e2e/test_training_e2e.py -v
5. Error Handling Tests
ModelLoadError
| Test | Description | Trigger | Expected |
|---|---|---|---|
| test_model_load_error_bad_name | Invalid HuggingFace model name | GRPOConfig(model_name="nonexistent/model-xyz-999") |
Fails fast; error message contains "nonexistent/model-xyz-999" |
ActionParseError (handled via fallback)
| Test | Description | Trigger | Expected |
|---|---|---|---|
| test_action_parse_fallback_logged | Unparseable action triggers warning log | Model outputs "¯\_(ツ)_/¯" |
Warning logged; returns QUERY fallback |
QuestionLoadError
| Test | Description | Trigger | Expected |
|---|---|---|---|
| test_question_load_missing_file | Questions path does not exist | GRPOConfig(questions_path="/nonexistent/q.json") |
Fails fast; error message contains the path |
| test_question_load_empty_file | Questions file is empty JSON array | questions.json containing [] |
Fails fast; clear error about empty questions |
| test_question_load_invalid_json | Questions file has invalid JSON | questions.json containing {broken |
Fails fast; JSON parse error |
OOMError
| Test | Description | Trigger | Expected |
|---|---|---|---|
| test_oom_guidance | OOM during training prints guidance | (Cannot reliably trigger in test; verify message formatting only) | Error handler message mentions reducing batch_size or num_generations |
Run: uv run pytest tests/unit/test_error_handling.py -v
6. Edge Cases Checklist
- Null/None inputs to parse_model_output
- Empty string inputs to parse_model_output
- Empty completions list to all reward functions
- Single-element completions list to all reward functions
- Very large batch (100+ prompts) to rollout_func
- Questions file with only hard questions and difficulty_filter=["easy"] (zero matches)
- step_budget=1 (immediate budget exhaustion after one action)
- step_budget=0 (zero budget)
- Unicode characters in model output (e.g., CJK, emoji)
- Model output exceeding max_new_tokens
- learning_rate=0.0 (no weight updates)
- num_generations=1 (minimum GRPO completions)
- Concurrent calls to reward functions (thread safety)
- Database with no tables (empty schema)
- Database with very large tables (performance)
7. Evidence Requirements
| Category | Evidence Type | Example |
|---|---|---|
| Unit tests | pytest output | X passed |
| Integration | pytest output | X passed |
| Error handling | pytest output | X passed |
| E2E | pytest output + training metrics | 1 passed, loss=X.XX |
| Reward functions | pytest output showing correct values | reward_correctness: [1.0, 0.0] |
| Parse fallback | pytest output + log capture | WARNING: unparseable action, falling back to QUERY |