sql_env / specs /F003-IMPLEMENTATION_SPEC.md
hjerpe's picture
Upload folder using huggingface_hub
5dd1bb4 verified

Implementation Specification

Change: F003 -- Dense Reward System (3-layer reward architecture) Date: 2026-03-27 Research Summary: specs/F003-RESEARCH_SUMMARY.md Verification Spec: See VERIFICATION_SPEC.md (generated by autocode-verification-planner) Behavior Delta: Archived to specs/behavior/sql-environment.md PR: https://github.com/hjerpe/sql-env/pull/9

Plan Status:

  • Draft
  • Approved for Implementation
  • Implementation Complete
  • Verification Passed

Core Intent (Immutable)

DO NOT MODIFY THIS SECTION DURING REFINEMENT Changes to Core Intent mean you're describing a different feature. If refinement reveals the need to change this section, create a new feature instead.

User Problem: Agents get meaningful feedback during exploration -- not just 0/1 at the end. A query that returns 40 when the answer is 42 gets partial credit. Discovering new schema info gets a small reward. This makes GRPO training converge.

Success Criteria:

  • Reward varies meaningfully: random exploration ~0.1, targeted queries ~0.3, correct answer ~1.3
  • Anti-gaming works: agent cannot farm rewards by repeating queries or describing everything
  • Progress signal coarsened to 5 bins to prevent reward hill-climbing

Avoid:

  • Reward hacking (agent exploiting shaping signals to inflate reward without solving the task)
  • Reward too sparse (no signal until terminal step defeats the purpose of dense rewards)
  • Over-complex reward that is hard to debug (keep each layer simple and independently testable)

Out of Scope:

  • Adaptive/learned reward weights (use fixed weights: 0.25/0.50/0.25)
  • Row-wise best-match alignment (add later if training shows need)
  • NumPy/SciPy dependencies (pure Python only)
  • Reward strategy classes or plugin architecture
  • F002 verifier integration (Layer 3 uses existing naive check)

0. Slicing & Scope Budget (Anti-Waterfall)

This spec must be executable in small, mergeable increments.

Scope Budget

  • Target: 3 slices
  • Hard max: <= 10 steps total
  • Each step must end in: implement -> verify -> merge

Slice Definition

A slice is a vertical increment that delivers user-visible value or a safe internal capability.

Each slice must have:

  • Clear outcome
  • Minimal interface change
  • Merge criteria

Note: Verification criteria are defined in VERIFICATION_SPEC.md (separate agent).

Status Icons

Step Status:

  • Not Started
  • [~] In Progress
  • Completed
  • [!] Blocked/Failed

Result Outcome:

  • PASS: Fully Successful (all tests passed, no issues)
  • WARN: Completed with Issues (needs follow-up)
  • FAIL: Failed/Blocked

1. Implementation Overview

Summary

Implement the 3-layer reward architecture in server/reward.py and wire it into SQLEnvironment.step(). Layer 1 provides operational signals (exec_ok, new_info, repeat penalty, step cost). Layer 2 computes progress-to-target for QUERY actions using a fixed weighted average of cardinality matching (0.25), value overlap (0.50), and numeric range proximity (0.25), binned to 5 levels with improvement-only gating. Layer 3 remains the existing terminal correctness signal. New reward-tracking fields are added to EpisodeContext, and gold_rows are cached at reset(). Existing tests that assert reward=None for non-terminal steps are updated.

Scope

In Scope:

  • server/reward.py: compute_step_reward(), Layer 1, Layer 2 with all sub-metrics, binning
  • models.py: New fields on EpisodeContext (gold_rows, query_hashes, best_progress, cumulative_step_reward, cumulative_new_info_reward)
  • server/sql_environment.py: Wire compute_step_reward() into step(), store gold_rows at reset()
  • Test updates for non-None step rewards

Out of Scope:

  • F002 verifier integration (Layer 3 uses existing _handle_answer)
  • Adaptive reward weights
  • Row-wise best-match alignment
  • NumPy/SciPy dependencies

1a. Execution Status

Progress: 7/7 steps complete Current Step: Finalization complete Last Updated: 2026-03-28T06:05:02Z Latest Result: PASS - Step 3.2 completed and final verification approved Blockers: None


1b. Risk Assessment

Risk Tier: Low

Risk Tier Definitions:

  • Low: Pure logic, non-user-facing, no security implications
  • Medium: User input handling, data validation, API changes
  • High: Authentication, payments, secrets management, untrusted input

High-Risk Indicators Present: None

Security Review Required: No

Justification: Pure computation logic operating on in-memory data structures. No user input handling, no network I/O, no authentication. All inputs are already validated by the environment before reaching reward functions.


2. Change Manifest

Files to Create

None (all files already exist).

Files to Modify

File Changes
models.py Add 5 new fields to EpisodeContext dataclass
server/reward.py Implement full reward module: compute_step_reward, Layer 1, Layer 2, sub-metrics, binning
server/sql_environment.py Store gold_rows at reset(), call compute_step_reward() in step()
tests/test_smoke.py Update assertions that expect reward=None for non-terminal steps

Files to Delete

None.


3. Interface Specifications

Modified Types

# Location: models.py
# CHANGE: Add reward-tracking fields to EpisodeContext

@dataclass
class EpisodeContext:
    """Per-episode server-side state (never sent to agent)."""

    episode_id: str
    db_connection: sqlite3.Connection
    question_record: QuestionRecord
    step_count: int = 0
    budget: int = 15
    described_tables: set[str] = dataclass_field(default_factory=set)
    action_log: list[str] = dataclass_field(default_factory=list)
    done: bool = False
    gold_answer: str | None = None
    # --- NEW fields for F003 ---
    gold_rows: list[tuple] = dataclass_field(default_factory=list)
    query_hashes: set[str] = dataclass_field(default_factory=set)
    best_progress: float = 0.0
    cumulative_step_reward: float = 0.0
    cumulative_new_info_reward: float = 0.0

New Functions

# Location: server/reward.py

def compute_step_reward(
    ctx: EpisodeContext,
    action_type: str,
    sql: str,
    rows: list[tuple] | None,
    error: str | None,
) -> float:
    """
    Compute dense reward for a single non-terminal step.

    Combines Layer 1 (operational) and Layer 2 (progress) signals.
    Clamps running total of step rewards to [-0.2, +0.5].

    Args:
        ctx: Current episode context (mutated: updates tracking fields).
        action_type: One of DESCRIBE, SAMPLE, QUERY.
        sql: The SQL string executed (used for repeat detection).
        rows: Result rows from query execution, or None if error.
        error: Error message if action failed, else None.

    Returns:
        Step reward (float). Also updates ctx.cumulative_step_reward.
    """


def _layer1_operational(
    ctx: EpisodeContext,
    action_type: str,
    sql: str,
    rows: list[tuple] | None,
    error: str | None,
) -> float:
    """
    Layer 1: Operational reward signals.

    Components:
        - exec_ok: +0.02 if query executed without error
        - new_info: +0.01 per new table discovered (capped at 0.10 cumulative)
        - repeat: -0.01 if exact query hash seen before
        - step_cost: -0.005 always

    Args:
        ctx: Episode context (mutated: updates query_hashes, cumulative_new_info_reward).
        action_type: Action type string.
        sql: SQL string for hash-based repeat detection.
        rows: Result rows (used to confirm exec_ok).
        error: Error message if action failed.

    Returns:
        Layer 1 reward component (float).
    """


def _layer2_progress(
    ctx: EpisodeContext,
    rows: list[tuple],
) -> float:
    """
    Layer 2: Progress-to-target for QUERY actions only.

    Computes weighted average of sub-metrics, bins to 5 levels,
    rewards only improvement over best-so-far, scaled by 0.15.

    Args:
        ctx: Episode context (mutated: updates best_progress).
        rows: Query result rows to compare against ctx.gold_rows.

    Returns:
        Layer 2 reward component (float). 0.0 if no improvement.
    """


def _cardinality_score(pred_rows: list[tuple], gold_rows: list[tuple]) -> float:
    """
    Row count similarity: 1 - |len(pred) - len(gold)| / max(len(pred), len(gold), 1).

    Returns:
        Score in [0.0, 1.0].
    """


def _value_overlap_score(pred_rows: list[tuple], gold_rows: list[tuple]) -> float:
    """
    Jaccard overlap of flattened cell values (as strings).

    Returns:
        Score in [0.0, 1.0].
    """


def _numeric_range_score(pred_rows: list[tuple], gold_rows: list[tuple]) -> float:
    """
    Log-distance proximity for numeric cells.

    For each numeric value in gold, find closest numeric in pred.
    Score = mean(1 / (1 + log(1 + |pred - gold|))) across gold numerics.
    Returns 1.0 if no numeric values in gold.

    Returns:
        Score in [0.0, 1.0].
    """


def _bin_progress(raw_score: float) -> float:
    """
    Bin raw progress score to {0, 0.25, 0.5, 0.75, 1.0}.

    Thresholds: [0, 0.125) -> 0, [0.125, 0.375) -> 0.25,
    [0.375, 0.625) -> 0.5, [0.625, 0.875) -> 0.75, [0.875, 1.0] -> 1.0.

    Returns:
        Binned score.
    """

4. Data Flow

Primary Flow (Non-terminal step with QUERY action)

1. step() receives action (QUERY, sql_string)
   - Input: SQLAction with action_type="QUERY", argument=sql

2. step() dispatches to _handle_query(sql)
   - Action: Executes SQL, returns formatted result
   - Side effect: Stores raw rows internally

3. step() calls compute_step_reward(ctx, "QUERY", sql, rows, error)
   - Input: episode context, action metadata, raw query rows

4. compute_step_reward calls _layer1_operational(ctx, "QUERY", sql, rows, None)
   - Computes: exec_ok(+0.02) + new_info(+0.01 if new tables) + repeat(-0.01 if seen) + step_cost(-0.005)
   - Side effect: Updates ctx.query_hashes, ctx.cumulative_new_info_reward

5. compute_step_reward calls _layer2_progress(ctx, rows)
   - Computes: weighted avg of cardinality(0.25) + value_overlap(0.50) + numeric_range(0.25)
   - Bins to {0, 0.25, 0.5, 0.75, 1.0}
   - Returns improvement * 0.15 (only if binned > ctx.best_progress)
   - Side effect: Updates ctx.best_progress

6. compute_step_reward clamps cumulative to [-0.2, +0.5]
   - Output: clamped step reward (float)
   - Side effect: Updates ctx.cumulative_step_reward

Alternative Flows

When action is DESCRIBE or SAMPLE:

1. step() dispatches to _handle_describe() or _handle_sample()
2. compute_step_reward calls _layer1_operational only (Layer 2 skipped)
3. Clamping applied as usual

When QUERY has SQL error:

1. _handle_query raises sqlite3.Error
2. step() catches error, sets self._last_error
3. compute_step_reward called with error=str(exc), rows=None
4. Layer 1: step_cost only (-0.005), no exec_ok
5. Layer 2: skipped (rows is None)

When gold_rows is empty:

1. _layer2_progress detects ctx.gold_rows is empty
2. Returns 0.0 (skip Layer 2 entirely)

When budget exhausted without ANSWER:

1. step() sets done=True, reward=0.0 (terminal)
2. No compute_step_reward call for this terminal step

5. Error Handling

Error Types

Error When Impact
SQL execution error Invalid query syntax / runtime error Layer 1: step_cost only, Layer 2 skipped
Empty gold_rows Gold SQL returned no rows Layer 2 returns 0.0, Layer 1 operates normally
Division by zero in metrics Both pred and gold are empty Protected by max(..., 1) denominators

Error Handling Strategy

# In compute_step_reward:
# - No exceptions should propagate; all edge cases return safe defaults
# - If error is not None, skip exec_ok and Layer 2
# - If rows is None, skip Layer 2
# - If gold_rows is empty, skip Layer 2

Retry Strategy

Operation Retry? Strategy
Reward computation No Pure function, deterministic, no I/O

6. Slice Plan (What we will ship, in order)

Slice S1 -- EpisodeContext Fields + Layer 1

Value: Every non-terminal step returns a small but meaningful reward signal based on operational quality User-visible change: Yes -- step observations now include non-None reward values Interfaces introduced/changed: 5 new fields on EpisodeContext, compute_step_reward(), _layer1_operational() Rollback safety: Additive only -- new fields have defaults, reward.py is new code

Slice S2 -- Layer 2 Progress Metrics

Value: QUERY actions receive progress-toward-answer signal, enabling convergent GRPO training User-visible change: Yes -- QUERY step rewards now reflect closeness to gold answer Interfaces introduced/changed: _layer2_progress(), _cardinality_score(), _value_overlap_score(), _numeric_range_score(), _bin_progress() Rollback safety: Additive to reward.py, no external interface changes

Slice S3 -- Wire into step() + Test Updates

Value: Full system integration -- environment returns dense rewards on every step User-visible change: Yes -- complete dense reward signal in step observations Interfaces introduced/changed: sql_environment.py:step() modified, sql_environment.py:reset() modified Rollback safety: Reversible by removing compute_step_reward call from step()


7. Implementation Steps

VERIFICATION NOTE: Test criteria for each step are defined in VERIFICATION_SPEC.md. The verification-planner (separate agent) generated independent test criteria. Run the tests specified there after implementing each step.

Step 1.1: Add reward-tracking fields to EpisodeContext

Slice: S1 Goal: Extend EpisodeContext with the 5 new fields required for reward tracking.

Files:

  • models.py - modify - Add gold_rows, query_hashes, best_progress, cumulative_step_reward, cumulative_new_info_reward fields

Interface Changes:

  • EpisodeContext dataclass gains 5 new fields (all with defaults, backward-compatible)

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: Low

Merge Criteria:

  • Tests from VERIFICATION_SPEC.md pass
  • No TODOs left in changed code (or explicitly tracked)
  • Backwards compatible (or flag/migration documented)

Status: Completed

Completed: 2026-03-27T23:51:47Z Changes Made:

  • models.py: Added EpisodeContext reward-tracking defaults for gold_rows, query_hashes, best_progress, cumulative_step_reward, and cumulative_new_info_reward.
  • tests/unit/test_reward.py: Added EpisodeContext-focused unit tests for new default fields and tuple-list gold_rows storage.

Result:

  • Outcome: PASS
  • Evidence Captured:
    Command: uv run --with pytest pytest tests/unit/test_reward.py -v -k "EpisodeContext"
    Result: 6 passed in 3.92s
    
  • Tests run: uv run --with pytest pytest tests/unit/test_reward.py -v -k "EpisodeContext"
  • Notes:
    • tests/unit/test_reward.py did not exist yet, so it was created to match verification spec coverage for EpisodeContext.
    • Used --with pytest because bare uv run pytest ... fails in this repo due missing local pytest executable.
    • Field additions are additive and backward compatible via defaults.
  • Issues: None
  • Follow-ups Created: None
  • Human Review Completed: N/A

Context for Next Step:

  • EpisodeContext now has all fields needed by reward functions

Step 1.2: Implement Layer 1 operational rewards

Slice: S1 Goal: Implement _layer1_operational() with exec_ok, new_info, repeat penalty, and step_cost signals.

Files:

  • server/reward.py - modify - Implement _layer1_operational() function

Interface Changes:

  • New function _layer1_operational(ctx, action_type, sql, rows, error) -> float

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: Low

Merge Criteria:

  • Tests from VERIFICATION_SPEC.md pass
  • No TODOs left in changed code (or explicitly tracked)
  • Backwards compatible (or flag/migration documented)

Status: Completed

Completed: 2026-03-27T23:54:50Z Changes Made:

  • server/reward.py: Implemented _layer1_operational() with step cost, exec-ok signal, repeat-query penalty, and capped new-info accumulation tracked on EpisodeContext.
  • tests/unit/test_reward.py: Added TestLayer1Operational coverage for successful actions, SQL error behavior, repeat penalties, and new-info cap behavior.

Result:

  • Outcome: PASS
  • Evidence Captured:
    Command: uv run --with pytest pytest tests/unit/test_reward.py -v -k "layer1"
    Result: 8 passed, 6 deselected in 3.89s
    
  • Tests run: uv run --with pytest pytest tests/unit/test_reward.py -v -k "layer1"
  • Notes:
    • uv run pytest ... still fails in this repo because pytest is not installed in the project environment; used uv run --with pytest ... to satisfy package-manager execution policy.
    • Repeat detection uses SHA-256 of the exact SQL string and suppresses exec_ok on repeated successful QUERY actions.
    • New-info reward is only granted on first-seen successful QUERY actions and is capped at 0.10 cumulative per episode.
  • Issues: None
  • Follow-ups Created: None
  • Human Review Completed: N/A

Context for Next Step:

  • Layer 1 operational shaping is complete and covered by unit tests; proceed with Layer 2 pure scoring helpers in server/reward.py.

Step 2.1: Implement Layer 2 sub-metrics

Slice: S2 Goal: Implement _cardinality_score(), _value_overlap_score(), _numeric_range_score(), and _bin_progress().

Files:

  • server/reward.py - modify - Add all four sub-metric functions

Interface Changes:

  • 4 new pure functions (no state mutation)

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: Low

Merge Criteria:

  • Tests from VERIFICATION_SPEC.md pass
  • No TODOs left in changed code (or explicitly tracked)
  • Backwards compatible (or flag/migration documented)

Status: Completed

Completed: 2026-03-27T23:58:44Z Changes Made:

  • server/reward.py: Added pure Layer 2 helper functions _cardinality_score(), _value_overlap_score(), _numeric_range_score(), and _bin_progress() with bounded outputs and edge-case handling.
  • tests/unit/test_reward.py: Added dedicated unit test coverage for all four sub-metrics, including boundary thresholds, empty inputs, mixed types, and numeric distance behavior.

Result:

  • Outcome: PASS
  • Evidence Captured:
    Command: uv run --with pytest pytest tests/unit/test_reward.py -v -k "cardinality or value_overlap or numeric_range or bin_progress"
    Result: 34 passed, 14 deselected in 5.06s
    
  • Tests run: uv run --with pytest pytest tests/unit/test_reward.py -v -k "cardinality or value_overlap or numeric_range or bin_progress"
  • Notes:
    • Implemented _bin_progress() with explicit clamping to [0.0, 1.0] before threshold binning.
    • Numeric range scoring excludes booleans from numeric extraction to avoid bool/int coercion artifacts.
    • All helpers are pure and deterministic, with no mutation of EpisodeContext.
  • Issues: None
  • Follow-ups Created: None
  • Human Review Completed: N/A

Context for Next Step:

  • Layer 2 helper metrics are now stable and tested; proceed to compose them in _layer2_progress() with weighted averaging and improvement-only gating.

Step 2.2: Implement Layer 2 progress composition

Slice: S2 Goal: Implement _layer2_progress() that combines sub-metrics with fixed weights (0.25/0.50/0.25), bins, and gates on improvement.

Files:

  • server/reward.py - modify - Add _layer2_progress() function

Interface Changes:

  • New function _layer2_progress(ctx, rows) -> float

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: Low

Merge Criteria:

  • Tests from VERIFICATION_SPEC.md pass
  • No TODOs left in changed code (or explicitly tracked)
  • Backwards compatible (or flag/migration documented)

Status: Completed

Completed: 2026-03-28T00:03:22Z Changes Made:

  • server/reward.py: Implemented _layer2_progress() using the fixed weighted composition (0.25/0.50/0.25), progress binning, improvement-only gating, and ctx.best_progress mutation on improvement.
  • tests/unit/test_reward.py: Added TestLayer2Progress coverage for perfect match, no-improvement gating, incremental improvement rewards, empty-gold behavior, weighted-average outcome, best-progress updates, and non-downgrade behavior.

Result:

  • Outcome: PASS
  • Evidence Captured:
    Command: uv run --with pytest pytest tests/unit/test_reward.py -v -k "layer2"
    Result: 7 passed, 48 deselected in 3.83s
    
  • Tests run: uv run --with pytest pytest tests/unit/test_reward.py -v -k "layer2"
  • Notes:
    • Implemented explicit constants for Layer 2 weights and improvement scale to keep composition intent readable and stable.
    • _layer2_progress() returns zero when gold_rows is empty and never reduces ctx.best_progress.
    • uv run pytest ... still requires --with pytest in this repository due missing local pytest executable.
  • Issues: None
  • Follow-ups Created: None
  • Human Review Completed: N/A

Context for Next Step:

  • Layer 2 composition is now complete and tested; next implement compute_step_reward() to combine Layer 1 + Layer 2 and apply cumulative clamping.

Step 2.3: Implement compute_step_reward with clamping

Slice: S2 Goal: Implement the main compute_step_reward() entry point that combines Layer 1 and Layer 2, applies clamping to [-0.2, +0.5].

Files:

  • server/reward.py - modify - Add compute_step_reward() function

Interface Changes:

  • New public function compute_step_reward(ctx, action_type, sql, rows, error) -> float

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: Low

Merge Criteria:

  • Tests from VERIFICATION_SPEC.md pass
  • No TODOs left in changed code (or explicitly tracked)
  • Backwards compatible (or flag/migration documented)

Status: Completed

Completed: 2026-03-28T00:06:56Z Changes Made:

  • server/reward.py: Implemented compute_step_reward() to compose Layer 1 and (QUERY-only) Layer 2 signals, then clamp cumulative step shaping to [-0.2, +0.5] while returning the per-step clamped delta.
  • tests/unit/test_reward.py: Added TestComputeStepReward coverage for query success/error paths, DESCRIBE/SAMPLE behavior, upper/lower clamp boundaries, clamp delta semantics, context mutation, and Layer 2 skip conditions.

Result:

  • Outcome: PASS
  • Evidence Captured:
    Command: uv run --with pytest pytest tests/unit/test_reward.py -v -k "compute_reward"
    Result: 11 passed, 55 deselected in 3.84s
    
  • Tests run: uv run --with pytest pytest tests/unit/test_reward.py -v -k "compute_reward"
  • Notes:
    • compute_step_reward() now updates ctx.cumulative_step_reward through clamp-aware delta computation so boundaries are enforced deterministically.
    • Layer 2 is only evaluated for successful QUERY actions (rows is not None and error is None) to keep non-query and error behavior aligned with spec.
    • Verification command from spec (-k "compute_step_reward") currently selects zero tests because test names use compute_reward; used -k "compute_reward" to execute the intended step suite.
  • Issues: None
  • Follow-ups Created: None
  • Human Review Completed: N/A

Context for Next Step:

  • Reward composition and clamp behavior are complete; next wire compute_step_reward() into environment reset()/step() flow and expose query rows for Layer 2 integration.

Step 3.1: Wire reward into step() and reset()

Slice: S3 Goal: Store gold_rows in EpisodeContext at reset(). Call compute_step_reward() from step() for non-terminal actions. Expose raw query rows for Layer 2.

Files:

  • server/sql_environment.py - modify - Update reset() to store gold_rows, update step() to call compute_step_reward, track raw query rows from _handle_query

Interface Changes:

  • reset(): Stores gold_rows in EpisodeContext
  • step(): Sets self._last_reward from compute_step_reward() for non-ANSWER actions

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: Low

Merge Criteria:

  • Tests from VERIFICATION_SPEC.md pass
  • No TODOs left in changed code (or explicitly tracked)
  • Backwards compatible (or flag/migration documented)

Status: Completed

Completed: 2026-03-28T05:56:43Z Changes Made:

  • server/sql_environment.py: Imported compute_step_reward and wired dense reward calculation into step() for all non-terminal valid actions.
  • server/sql_environment.py: Updated _handle_query() to return both formatted output and raw SQL rows so QUERY actions feed Layer 2 progress scoring.
  • server/sql_environment.py: Preserved terminal budget behavior by skipping dense reward computation when the step exhausts budget (terminal reward remains 0.0).

Result:

  • Outcome: PASS
  • Evidence Captured:
    Command: uv run --with pytest pytest tests/unit/test_reward.py -v -k "compute_reward or layer1 or layer2"
    Result: 26 passed, 40 deselected in 4.85s
    
    Command: uv run --with pytest pytest tests/test_smoke.py -v -k "describe_reveals_columns_and_updates_schema or sample_and_query_success or query_rejects_non_select or budget_exhaustion_sets_done_and_zero_reward or query_timeout_returns_error"
    Result: 5 passed, 20 deselected in 4.12s
    
  • Tests run:
    • uv run --with pytest pytest tests/unit/test_reward.py -v -k "compute_reward or layer1 or layer2"
    • uv run --with pytest pytest tests/test_smoke.py -v -k "describe_reveals_columns_and_updates_schema or sample_and_query_success or query_rejects_non_select or budget_exhaustion_sets_done_and_zero_reward or query_timeout_returns_error"
  • Notes:
    • Dense shaping now executes in the environment action loop for non-terminal steps while keeping ANSWER and budget-exhaustion terminal reward semantics unchanged.
    • QUERY actions now pass raw rows through to reward computation; DESCRIBE/SAMPLE paths compute Layer 1-only reward.
    • Used uv run --with pytest ... due local uv run pytest ... executable mismatch in this repository environment.
  • Issues: None
  • Follow-ups Created: None
  • Human Review Completed: N/A

Context for Next Step:

  • Existing smoke tests still assert reward is None for reset and non-terminal paths; update those assertions to match dense reward behavior.

Step 3.2: Update existing tests for dense rewards

Slice: S3 Goal: Update tests in tests/test_smoke.py that assert reward=None for non-terminal steps to expect numeric reward values instead.

Files:

  • tests/test_smoke.py - modify - Update reward assertions for non-terminal steps

Interface Changes:

  • None (test-only changes)

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: Low

Merge Criteria:

  • Tests from VERIFICATION_SPEC.md pass
  • No TODOs left in changed code (or explicitly tracked)
  • Backwards compatible (or flag/migration documented)

Status: Completed

Completed: 2026-03-28T06:05:02Z Changes Made:

  • tests/test_smoke.py: Updated non-terminal action assertions to validate dense reward values instead of implicit None semantics.
  • tests/test_smoke.py: Added concrete reward checks for DESCRIBE/SAMPLE (0.015), QUERY positive reward, non-SELECT QUERY penalty (-0.005), and first-step budget exhaustion reward behavior.

Result:

  • Outcome: PASS
  • Evidence Captured:
    Command: uv run --with pytest pytest tests/test_smoke.py -v
    Result: 25 passed in 4.04s
    
    Command: uv run --with pytest pytest tests/ -v
    Result: 166 passed, 1 skipped in 4.29s
    
    Verifier: APPROVED (high confidence, no critical findings)
    
  • Tests run:
    • uv run --with pytest pytest tests/test_smoke.py -v
    • uv run --with pytest pytest tests/ -v
  • Notes:
    • uv run pytest ... fails in this repository because pytest is not installed in the project environment; verification used uv run --with pytest ... while staying package-manager scoped.
    • Assertions now align with dense-reward behavior and reinforce terminality checks via done rather than reward is None for non-terminal steps.
    • Finalization included verifier approval, behavior-delta archival, and durable learning extraction.
  • Issues: None
  • Follow-ups Created: None
  • Human Review Completed: N/A

Context for Next Step:

  • Implementation steps are complete; proceed with /commit-push-pr when ready.

8. Rollout Considerations

Feature Flags

  • Required: No
  • Flag name: N/A

Migration

  • Data migration needed: No

Rollback Plan

Remove the compute_step_reward() call from step() and revert self._last_reward = None for non-ANSWER actions. The new EpisodeContext fields are harmless if unused.


9. Execution Tracking

All execution state is tracked within this document:

  • Section 1a: Overall progress summary
  • Section 7: Per-step completion details, test results, and handoff context
  • FEATURES.json: Feature-level status/progress metadata used by /autocode-next-step and opencode-ctx ralph run
  • Git history: Full audit trail of changes to this file

The implementing agent updates this document after each step and keeps the matching FEATURES.json entry in sync during implementation/finalization. Humans can monitor progress by:

  • Checking Section 1a for summary
  • Reviewing Section 7 for detailed step status
  • Inspecting the feature's progress and status fields in FEATURES.json
  • Running git log --oneline IMPLEMENTATION_SPEC.md for change history

9a. Slice Completion Protocol

After all steps in a slice pass verification:

  1. Run verifier subagent for spec compliance

    • Validates against VERIFICATION_SPEC.md criteria
    • Ensures no TODOs or incomplete work in slice
  2. Run compound-engineer subagent to extract learnings

    • Mandatory invocation after every slice completion
    • Updates CLAUDE.md Learnings section (if durable patterns found)
    • May exit with "no update needed" (valid for routine work)
  3. Commit the slice changes

    • Follow commit message format in CLAUDE.md
    • Each slice gets its own atomic commit
  4. Continue to next slice (if more slices remain)

    • Or proceed to final verification if all slices complete

Note: PR creation happens only after ALL slices are complete. Use /commit-push-pr manually when ready.


10. User Value Summary

Status: Generated

What Users Can Now Do

Agents now receive meaningful numeric reward feedback on every non-terminal SQL exploration step, not just terminal correctness at ANSWER time.

How to Access/Test

Run a normal episode (reset then DESCRIBE/SAMPLE/QUERY) and observe per-step observation.reward values changing with execution quality and answer progress.

Demo

  • Command: uv run --with pytest pytest tests/test_smoke.py -v
  • Proof points: DESCRIBE/SAMPLE rewards are 0.015, invalid non-SELECT QUERY gets -0.005, QUERY returns positive dense reward, terminal budget-exhaustion still yields 0.0.

Release Notes Snippet

Dense 3-layer reward shaping is now fully integrated: all non-terminal actions emit numeric rewards, repeat/farming controls are enforced, progress-to-answer rewards are gated by improvement, and terminal correctness remains dominant.


11. PR Contract (Auto-Generated by autocode-next-step)

Status: Generated

Scope Delivered

  • Dense reward system implemented across models.py, server/reward.py, server/sql_environment.py, and test coverage updates in tests/test_smoke.py and tests/unit/test_reward.py.
  • Final non-terminal reward assertions now match shipped behavior and protect against regressions.

Verification Evidence

  • uv run --with pytest pytest tests/test_smoke.py -v -> 25 passed
  • uv run --with pytest pytest tests/ -v -> 166 passed, 1 skipped
  • Verifier subagent verdict: approved (high confidence, no critical findings)

Risks and Mitigations

  • Risk: Legacy callers infer terminality from reward is None.
  • Mitigation: Behavior spec now documents terminality contract based on done; smoke tests enforce non-terminal numeric rewards.

Follow-up

  • Ready for commit/PR via /commit-push-pr.

Stop Conditions (When to Split This Spec)

Stop and create a new IMPLEMENTATION_SPEC if:

  • A step requires touching more than 3 files in unrelated areas
  • You need to introduce multiple new abstractions "just in case"
  • Verification cannot be made targeted and concrete
  • You discover new unknowns that change the plan materially
  • The next slice cannot be merged safely without finishing later slices

When splitting, ensure the current slice ends in a merged, stable state.


Human Checkpoint

Before handing to AI agent:

  • Interface specifications are complete
  • Data flow is accurate
  • Error handling is specified
  • Implementation order makes sense
  • VERIFICATION_SPEC.md has been generated

Questions:

  1. None

Handoff Notes

For the implementing AI agent:

Context: See RESEARCH_SUMMARY.md for system understanding
Spec: Follow this document exactly
Verification: Use tests from VERIFICATION_SPEC.md (independent agent)
Ambiguity: Stop and ask rather than assume
Order: Follow implementation order exactly
Key decisions already made:
  - Layer 2 weights: 0.25 cardinality, 0.50 value overlap, 0.25 numeric range (fixed)
  - gold_rows stored in EpisodeContext, populated at reset()
  - Progress bins: {0, 0.25, 0.5, 0.75, 1.0}
  - Clamping: [-0.2, +0.5] cumulative step reward
  - Pure Python only, no numpy/scipy

Specification completed: 2026-03-27 Verification input: specs/F003-VERIFICATION_INPUT.json Target agent: Claude Code