Spaces:

hjerpe
/

sql_env

Running

App Files Files Community

sql_env / specs /F003-IMPLEMENTATION_SPEC.md

hjerpe

Upload folder using huggingface_hub

5dd1bb4 verified 21 days ago

preview code

raw

history blame contribute delete

35.2 kB

Implementation Specification

Change: F003 -- Dense Reward System (3-layer reward architecture) Date: 2026-03-27 Research Summary: specs/F003-RESEARCH_SUMMARY.md Verification Spec: See VERIFICATION_SPEC.md (generated by autocode-verification-planner) Behavior Delta: Archived to specs/behavior/sql-environment.md PR: https://github.com/hjerpe/sql-env/pull/9

Plan Status:

Draft
Approved for Implementation
Implementation Complete
Verification Passed

Core Intent (Immutable)

DO NOT MODIFY THIS SECTION DURING REFINEMENT Changes to Core Intent mean you're describing a different feature. If refinement reveals the need to change this section, create a new feature instead.

User Problem: Agents get meaningful feedback during exploration -- not just 0/1 at the end. A query that returns 40 when the answer is 42 gets partial credit. Discovering new schema info gets a small reward. This makes GRPO training converge.

Success Criteria:

Reward varies meaningfully: random exploration ~0.1, targeted queries ~0.3, correct answer ~1.3
Anti-gaming works: agent cannot farm rewards by repeating queries or describing everything
Progress signal coarsened to 5 bins to prevent reward hill-climbing

Avoid:

Reward hacking (agent exploiting shaping signals to inflate reward without solving the task)
Reward too sparse (no signal until terminal step defeats the purpose of dense rewards)
Over-complex reward that is hard to debug (keep each layer simple and independently testable)

Out of Scope:

Adaptive/learned reward weights (use fixed weights: 0.25/0.50/0.25)
Row-wise best-match alignment (add later if training shows need)
NumPy/SciPy dependencies (pure Python only)
Reward strategy classes or plugin architecture
F002 verifier integration (Layer 3 uses existing naive check)

0. Slicing & Scope Budget (Anti-Waterfall)

This spec must be executable in small, mergeable increments.

Scope Budget

Target: 3 slices
Hard max: <= 10 steps total
Each step must end in: implement -> verify -> merge

Slice Definition

A slice is a vertical increment that delivers user-visible value or a safe internal capability.

Each slice must have:

Clear outcome
Minimal interface change
Merge criteria

Note: Verification criteria are defined in VERIFICATION_SPEC.md (separate agent).

Status Icons

Step Status:

Not Started
[~] In Progress
Completed
[!] Blocked/Failed

Result Outcome:

PASS: Fully Successful (all tests passed, no issues)
WARN: Completed with Issues (needs follow-up)
FAIL: Failed/Blocked

1. Implementation Overview

Summary

Implement the 3-layer reward architecture in server/reward.py and wire it into SQLEnvironment.step(). Layer 1 provides operational signals (exec_ok, new_info, repeat penalty, step cost). Layer 2 computes progress-to-target for QUERY actions using a fixed weighted average of cardinality matching (0.25), value overlap (0.50), and numeric range proximity (0.25), binned to 5 levels with improvement-only gating. Layer 3 remains the existing terminal correctness signal. New reward-tracking fields are added to EpisodeContext, and gold_rows are cached at reset(). Existing tests that assert reward=None for non-terminal steps are updated.

Scope

In Scope:

server/reward.py: compute_step_reward(), Layer 1, Layer 2 with all sub-metrics, binning
models.py: New fields on EpisodeContext (gold_rows, query_hashes, best_progress, cumulative_step_reward, cumulative_new_info_reward)
server/sql_environment.py: Wire compute_step_reward() into step(), store gold_rows at reset()
Test updates for non-None step rewards

Out of Scope:

F002 verifier integration (Layer 3 uses existing _handle_answer)
Adaptive reward weights
Row-wise best-match alignment
NumPy/SciPy dependencies

1a. Execution Status

Progress: 7/7 steps complete Current Step: Finalization complete Last Updated: 2026-03-28T06:05:02Z Latest Result: PASS - Step 3.2 completed and final verification approved Blockers: None

1b. Risk Assessment

Risk Tier: Low

Risk Tier Definitions:

Low: Pure logic, non-user-facing, no security implications
Medium: User input handling, data validation, API changes
High: Authentication, payments, secrets management, untrusted input

High-Risk Indicators Present: None

Security Review Required: No

Justification: Pure computation logic operating on in-memory data structures. No user input handling, no network I/O, no authentication. All inputs are already validated by the environment before reaching reward functions.

2. Change Manifest

Files to Create

None (all files already exist).

Files to Modify

File	Changes
`models.py`	Add 5 new fields to `EpisodeContext` dataclass
`server/reward.py`	Implement full reward module: `compute_step_reward`, Layer 1, Layer 2, sub-metrics, binning
`server/sql_environment.py`	Store `gold_rows` at `reset()`, call `compute_step_reward()` in `step()`
`tests/test_smoke.py`	Update assertions that expect `reward=None` for non-terminal steps

Files to Delete

None.

3. Interface Specifications

Modified Types

# Location: models.py
# CHANGE: Add reward-tracking fields to EpisodeContext

@dataclass
class EpisodeContext:
    """Per-episode server-side state (never sent to agent)."""

    episode_id: str
    db_connection: sqlite3.Connection
    question_record: QuestionRecord
    step_count: int = 0
    budget: int = 15
    described_tables: set[str] = dataclass_field(default_factory=set)
    action_log: list[str] = dataclass_field(default_factory=list)
    done: bool = False
    gold_answer: str | None = None
    # --- NEW fields for F003 ---
    gold_rows: list[tuple] = dataclass_field(default_factory=list)
    query_hashes: set[str] = dataclass_field(default_factory=set)
    best_progress: float = 0.0
    cumulative_step_reward: float = 0.0
    cumulative_new_info_reward: float = 0.0

New Functions

# Location: server/reward.py

def compute_step_reward(
    ctx: EpisodeContext,
    action_type: str,
    sql: str,
    rows: list[tuple] | None,
    error: str | None,
) -> float:
    """
    Compute dense reward for a single non-terminal step.

    Combines Layer 1 (operational) and Layer 2 (progress) signals.
    Clamps running total of step rewards to [-0.2, +0.5].

    Args:
        ctx: Current episode context (mutated: updates tracking fields).
        action_type: One of DESCRIBE, SAMPLE, QUERY.
        sql: The SQL string executed (used for repeat detection).
        rows: Result rows from query execution, or None if error.
        error: Error message if action failed, else None.

    Returns:
        Step reward (float). Also updates ctx.cumulative_step_reward.
    """


def _layer1_operational(
    ctx: EpisodeContext,
    action_type: str,
    sql: str,
    rows: list[tuple] | None,
    error: str | None,
) -> float:
    """
    Layer 1: Operational reward signals.

    Components:
        - exec_ok: +0.02 if query executed without error
        - new_info: +0.01 per new table discovered (capped at 0.10 cumulative)
        - repeat: -0.01 if exact query hash seen before
        - step_cost: -0.005 always

    Args:
        ctx: Episode context (mutated: updates query_hashes, cumulative_new_info_reward).
        action_type: Action type string.
        sql: SQL string for hash-based repeat detection.
        rows: Result rows (used to confirm exec_ok).
        error: Error message if action failed.

    Returns:
        Layer 1 reward component (float).
    """


def _layer2_progress(
    ctx: EpisodeContext,
    rows: list[tuple],
) -> float:
    """
    Layer 2: Progress-to-target for QUERY actions only.

    Computes weighted average of sub-metrics, bins to 5 levels,
    rewards only improvement over best-so-far, scaled by 0.15.

    Args:
        ctx: Episode context (mutated: updates best_progress).
        rows: Query result rows to compare against ctx.gold_rows.

    Returns:
        Layer 2 reward component (float). 0.0 if no improvement.
    """


def _cardinality_score(pred_rows: list[tuple], gold_rows: list[tuple]) -> float:
    """
    Row count similarity: 1 - |len(pred) - len(gold)| / max(len(pred), len(gold), 1).

    Returns:
        Score in [0.0, 1.0].
    """


def _value_overlap_score(pred_rows: list[tuple], gold_rows: list[tuple]) -> float:
    """
    Jaccard overlap of flattened cell values (as strings).

    Returns:
        Score in [0.0, 1.0].
    """


def _numeric_range_score(pred_rows: list[tuple], gold_rows: list[tuple]) -> float:
    """
    Log-distance proximity for numeric cells.

    For each numeric value in gold, find closest numeric in pred.
    Score = mean(1 / (1 + log(1 + |pred - gold|))) across gold numerics.
    Returns 1.0 if no numeric values in gold.

    Returns:
        Score in [0.0, 1.0].
    """


def _bin_progress(raw_score: float) -> float:
    """
    Bin raw progress score to {0, 0.25, 0.5, 0.75, 1.0}.

    Thresholds: [0, 0.125) -> 0, [0.125, 0.375) -> 0.25,
    [0.375, 0.625) -> 0.5, [0.625, 0.875) -> 0.75, [0.875, 1.0] -> 1.0.

    Returns:
        Binned score.
    """

4. Data Flow

Primary Flow (Non-terminal step with QUERY action)

1. step() receives action (QUERY, sql_string)
   - Input: SQLAction with action_type="QUERY", argument=sql

2. step() dispatches to _handle_query(sql)
   - Action: Executes SQL, returns formatted result
   - Side effect: Stores raw rows internally

3. step() calls compute_step_reward(ctx, "QUERY", sql, rows, error)
   - Input: episode context, action metadata, raw query rows

4. compute_step_reward calls _layer1_operational(ctx, "QUERY", sql, rows, None)
   - Computes: exec_ok(+0.02) + new_info(+0.01 if new tables) + repeat(-0.01 if seen) + step_cost(-0.005)
   - Side effect: Updates ctx.query_hashes, ctx.cumulative_new_info_reward

5. compute_step_reward calls _layer2_progress(ctx, rows)
   - Computes: weighted avg of cardinality(0.25) + value_overlap(0.50) + numeric_range(0.25)
   - Bins to {0, 0.25, 0.5, 0.75, 1.0}
   - Returns improvement * 0.15 (only if binned > ctx.best_progress)
   - Side effect: Updates ctx.best_progress

6. compute_step_reward clamps cumulative to [-0.2, +0.5]
   - Output: clamped step reward (float)
   - Side effect: Updates ctx.cumulative_step_reward

Alternative Flows

When action is DESCRIBE or SAMPLE:

1. step() dispatches to _handle_describe() or _handle_sample()
2. compute_step_reward calls _layer1_operational only (Layer 2 skipped)
3. Clamping applied as usual

When QUERY has SQL error:

1. _handle_query raises sqlite3.Error
2. step() catches error, sets self._last_error
3. compute_step_reward called with error=str(exc), rows=None
4. Layer 1: step_cost only (-0.005), no exec_ok
5. Layer 2: skipped (rows is None)

When gold_rows is empty:

1. _layer2_progress detects ctx.gold_rows is empty
2. Returns 0.0 (skip Layer 2 entirely)

When budget exhausted without ANSWER:

1. step() sets done=True, reward=0.0 (terminal)
2. No compute_step_reward call for this terminal step

5. Error Handling

Error Types

Error	When	Impact
SQL execution error	Invalid query syntax / runtime error	Layer 1: step_cost only, Layer 2 skipped
Empty gold_rows	Gold SQL returned no rows	Layer 2 returns 0.0, Layer 1 operates normally
Division by zero in metrics	Both pred and gold are empty	Protected by `max(..., 1)` denominators

Error Handling Strategy

# In compute_step_reward:
# - No exceptions should propagate; all edge cases return safe defaults
# - If error is not None, skip exec_ok and Layer 2
# - If rows is None, skip Layer 2
# - If gold_rows is empty, skip Layer 2

Retry Strategy

Operation	Retry?	Strategy
Reward computation	No	Pure function, deterministic, no I/O

6. Slice Plan (What we will ship, in order)

Slice S1 -- EpisodeContext Fields + Layer 1

Value: Every non-terminal step returns a small but meaningful reward signal based on operational quality User-visible change: Yes -- step observations now include non-None reward values Interfaces introduced/changed: 5 new fields on EpisodeContext, compute_step_reward(), _layer1_operational() Rollback safety: Additive only -- new fields have defaults, reward.py is new code

Slice S2 -- Layer 2 Progress Metrics

Value: QUERY actions receive progress-toward-answer signal, enabling convergent GRPO training User-visible change: Yes -- QUERY step rewards now reflect closeness to gold answer Interfaces introduced/changed: _layer2_progress(), _cardinality_score(), _value_overlap_score(), _numeric_range_score(), _bin_progress() Rollback safety: Additive to reward.py, no external interface changes

Slice S3 -- Wire into step() + Test Updates

Value: Full system integration -- environment returns dense rewards on every step User-visible change: Yes -- complete dense reward signal in step observations Interfaces introduced/changed: sql_environment.py:step() modified, sql_environment.py:reset() modified Rollback safety: Reversible by removing compute_step_reward call from step()

7. Implementation Steps

VERIFICATION NOTE: Test criteria for each step are defined in VERIFICATION_SPEC.md. The verification-planner (separate agent) generated independent test criteria. Run the tests specified there after implementing each step.

Step 1.1: Add reward-tracking fields to EpisodeContext

Slice: S1 Goal: Extend EpisodeContext with the 5 new fields required for reward tracking.

Files:

models.py - modify - Add gold_rows, query_hashes, best_progress, cumulative_step_reward, cumulative_new_info_reward fields

Interface Changes:

EpisodeContext dataclass gains 5 new fields (all with defaults, backward-compatible)

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: Low

Merge Criteria:

Tests from VERIFICATION_SPEC.md pass
No TODOs left in changed code (or explicitly tracked)
Backwards compatible (or flag/migration documented)

Status: Completed

Completed: 2026-03-27T23:51:47Z Changes Made:

models.py: Added EpisodeContext reward-tracking defaults for gold_rows, query_hashes, best_progress, cumulative_step_reward, and cumulative_new_info_reward.
tests/unit/test_reward.py: Added EpisodeContext-focused unit tests for new default fields and tuple-list gold_rows storage.

Result:

Outcome: PASS

Evidence Captured:

Command: uv run --with pytest pytest tests/unit/test_reward.py -v -k "EpisodeContext"
Result: 6 passed in 3.92s

Tests run: uv run --with pytest pytest tests/unit/test_reward.py -v -k "EpisodeContext"
Notes:
- tests/unit/test_reward.py did not exist yet, so it was created to match verification spec coverage for EpisodeContext.
- Used --with pytest because bare uv run pytest ... fails in this repo due missing local pytest executable.
- Field additions are additive and backward compatible via defaults.
Issues: None
Follow-ups Created: None
Human Review Completed: N/A

Context for Next Step:

EpisodeContext now has all fields needed by reward functions

Step 1.2: Implement Layer 1 operational rewards

Slice: S1 Goal: Implement _layer1_operational() with exec_ok, new_info, repeat penalty, and step_cost signals.

Files:

server/reward.py - modify - Implement _layer1_operational() function

Interface Changes:

New function _layer1_operational(ctx, action_type, sql, rows, error) -> float

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: Low

Merge Criteria:

Tests from VERIFICATION_SPEC.md pass
No TODOs left in changed code (or explicitly tracked)
Backwards compatible (or flag/migration documented)

Status: Completed

Completed: 2026-03-27T23:54:50Z Changes Made:

server/reward.py: Implemented _layer1_operational() with step cost, exec-ok signal, repeat-query penalty, and capped new-info accumulation tracked on EpisodeContext.
tests/unit/test_reward.py: Added TestLayer1Operational coverage for successful actions, SQL error behavior, repeat penalties, and new-info cap behavior.

Result:

Outcome: PASS

Evidence Captured:

Command: uv run --with pytest pytest tests/unit/test_reward.py -v -k "layer1"
Result: 8 passed, 6 deselected in 3.89s

Tests run: uv run --with pytest pytest tests/unit/test_reward.py -v -k "layer1"
Notes:
- uv run pytest ... still fails in this repo because pytest is not installed in the project environment; used uv run --with pytest ... to satisfy package-manager execution policy.
- Repeat detection uses SHA-256 of the exact SQL string and suppresses exec_ok on repeated successful QUERY actions.
- New-info reward is only granted on first-seen successful QUERY actions and is capped at 0.10 cumulative per episode.
Issues: None
Follow-ups Created: None
Human Review Completed: N/A

Context for Next Step:

Layer 1 operational shaping is complete and covered by unit tests; proceed with Layer 2 pure scoring helpers in server/reward.py.

Step 2.1: Implement Layer 2 sub-metrics

Slice: S2 Goal: Implement _cardinality_score(), _value_overlap_score(), _numeric_range_score(), and _bin_progress().

Files:

server/reward.py - modify - Add all four sub-metric functions

Interface Changes:

4 new pure functions (no state mutation)

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: Low

Merge Criteria:

Tests from VERIFICATION_SPEC.md pass
No TODOs left in changed code (or explicitly tracked)
Backwards compatible (or flag/migration documented)

Status: Completed

Completed: 2026-03-27T23:58:44Z Changes Made:

server/reward.py: Added pure Layer 2 helper functions _cardinality_score(), _value_overlap_score(), _numeric_range_score(), and _bin_progress() with bounded outputs and edge-case handling.
tests/unit/test_reward.py: Added dedicated unit test coverage for all four sub-metrics, including boundary thresholds, empty inputs, mixed types, and numeric distance behavior.

Result:

Outcome: PASS

Evidence Captured:

Command: uv run --with pytest pytest tests/unit/test_reward.py -v -k "cardinality or value_overlap or numeric_range or bin_progress"
Result: 34 passed, 14 deselected in 5.06s

Tests run: uv run --with pytest pytest tests/unit/test_reward.py -v -k "cardinality or value_overlap or numeric_range or bin_progress"
Notes:
- Implemented _bin_progress() with explicit clamping to [0.0, 1.0] before threshold binning.
- Numeric range scoring excludes booleans from numeric extraction to avoid bool/int coercion artifacts.
- All helpers are pure and deterministic, with no mutation of EpisodeContext.
Issues: None
Follow-ups Created: None
Human Review Completed: N/A

Context for Next Step:

Layer 2 helper metrics are now stable and tested; proceed to compose them in _layer2_progress() with weighted averaging and improvement-only gating.

Step 2.2: Implement Layer 2 progress composition

Slice: S2 Goal: Implement _layer2_progress() that combines sub-metrics with fixed weights (0.25/0.50/0.25), bins, and gates on improvement.

Files:

server/reward.py - modify - Add _layer2_progress() function

Interface Changes:

New function _layer2_progress(ctx, rows) -> float

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: Low

Merge Criteria:

Tests from VERIFICATION_SPEC.md pass
No TODOs left in changed code (or explicitly tracked)
Backwards compatible (or flag/migration documented)

Status: Completed

Completed: 2026-03-28T00:03:22Z Changes Made:

server/reward.py: Implemented _layer2_progress() using the fixed weighted composition (0.25/0.50/0.25), progress binning, improvement-only gating, and ctx.best_progress mutation on improvement.
tests/unit/test_reward.py: Added TestLayer2Progress coverage for perfect match, no-improvement gating, incremental improvement rewards, empty-gold behavior, weighted-average outcome, best-progress updates, and non-downgrade behavior.

Result:

Outcome: PASS

Evidence Captured:

Command: uv run --with pytest pytest tests/unit/test_reward.py -v -k "layer2"
Result: 7 passed, 48 deselected in 3.83s

Tests run: uv run --with pytest pytest tests/unit/test_reward.py -v -k "layer2"
Notes:
- Implemented explicit constants for Layer 2 weights and improvement scale to keep composition intent readable and stable.
- _layer2_progress() returns zero when gold_rows is empty and never reduces ctx.best_progress.
- uv run pytest ... still requires --with pytest in this repository due missing local pytest executable.
Issues: None
Follow-ups Created: None
Human Review Completed: N/A

Context for Next Step:

Layer 2 composition is now complete and tested; next implement compute_step_reward() to combine Layer 1 + Layer 2 and apply cumulative clamping.

Step 2.3: Implement compute_step_reward with clamping

Slice: S2 Goal: Implement the main compute_step_reward() entry point that combines Layer 1 and Layer 2, applies clamping to [-0.2, +0.5].

Files:

server/reward.py - modify - Add compute_step_reward() function

Interface Changes:

New public function compute_step_reward(ctx, action_type, sql, rows, error) -> float

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: Low

Merge Criteria:

Tests from VERIFICATION_SPEC.md pass
No TODOs left in changed code (or explicitly tracked)
Backwards compatible (or flag/migration documented)

Status: Completed

Completed: 2026-03-28T00:06:56Z Changes Made:

server/reward.py: Implemented compute_step_reward() to compose Layer 1 and (QUERY-only) Layer 2 signals, then clamp cumulative step shaping to [-0.2, +0.5] while returning the per-step clamped delta.
tests/unit/test_reward.py: Added TestComputeStepReward coverage for query success/error paths, DESCRIBE/SAMPLE behavior, upper/lower clamp boundaries, clamp delta semantics, context mutation, and Layer 2 skip conditions.

Result:

Outcome: PASS

Evidence Captured:

Command: uv run --with pytest pytest tests/unit/test_reward.py -v -k "compute_reward"
Result: 11 passed, 55 deselected in 3.84s

Tests run: uv run --with pytest pytest tests/unit/test_reward.py -v -k "compute_reward"
Notes:
- compute_step_reward() now updates ctx.cumulative_step_reward through clamp-aware delta computation so boundaries are enforced deterministically.
- Layer 2 is only evaluated for successful QUERY actions (rows is not None and error is None) to keep non-query and error behavior aligned with spec.
- Verification command from spec (-k "compute_step_reward") currently selects zero tests because test names use compute_reward; used -k "compute_reward" to execute the intended step suite.
Issues: None
Follow-ups Created: None
Human Review Completed: N/A

Context for Next Step:

Reward composition and clamp behavior are complete; next wire compute_step_reward() into environment reset()/step() flow and expose query rows for Layer 2 integration.

Step 3.1: Wire reward into step() and reset()

Slice: S3 Goal: Store gold_rows in EpisodeContext at reset(). Call compute_step_reward() from step() for non-terminal actions. Expose raw query rows for Layer 2.

Files:

server/sql_environment.py - modify - Update reset() to store gold_rows, update step() to call compute_step_reward, track raw query rows from _handle_query

Interface Changes:

reset(): Stores gold_rows in EpisodeContext
step(): Sets self._last_reward from compute_step_reward() for non-ANSWER actions

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: Low

Merge Criteria:

Tests from VERIFICATION_SPEC.md pass
No TODOs left in changed code (or explicitly tracked)
Backwards compatible (or flag/migration documented)

Status: Completed

Completed: 2026-03-28T05:56:43Z Changes Made:

server/sql_environment.py: Imported compute_step_reward and wired dense reward calculation into step() for all non-terminal valid actions.
server/sql_environment.py: Updated _handle_query() to return both formatted output and raw SQL rows so QUERY actions feed Layer 2 progress scoring.
server/sql_environment.py: Preserved terminal budget behavior by skipping dense reward computation when the step exhausts budget (terminal reward remains 0.0).

Result:

Outcome: PASS

Evidence Captured:

Command: uv run --with pytest pytest tests/unit/test_reward.py -v -k "compute_reward or layer1 or layer2"
Result: 26 passed, 40 deselected in 4.85s

Command: uv run --with pytest pytest tests/test_smoke.py -v -k "describe_reveals_columns_and_updates_schema or sample_and_query_success or query_rejects_non_select or budget_exhaustion_sets_done_and_zero_reward or query_timeout_returns_error"
Result: 5 passed, 20 deselected in 4.12s

Tests run:
- uv run --with pytest pytest tests/unit/test_reward.py -v -k "compute_reward or layer1 or layer2"
- uv run --with pytest pytest tests/test_smoke.py -v -k "describe_reveals_columns_and_updates_schema or sample_and_query_success or query_rejects_non_select or budget_exhaustion_sets_done_and_zero_reward or query_timeout_returns_error"
Notes:
- Dense shaping now executes in the environment action loop for non-terminal steps while keeping ANSWER and budget-exhaustion terminal reward semantics unchanged.
- QUERY actions now pass raw rows through to reward computation; DESCRIBE/SAMPLE paths compute Layer 1-only reward.
- Used uv run --with pytest ... due local uv run pytest ... executable mismatch in this repository environment.
Issues: None
Follow-ups Created: None
Human Review Completed: N/A

Context for Next Step:

Existing smoke tests still assert reward is None for reset and non-terminal paths; update those assertions to match dense reward behavior.

Step 3.2: Update existing tests for dense rewards

Slice: S3 Goal: Update tests in tests/test_smoke.py that assert reward=None for non-terminal steps to expect numeric reward values instead.

Files:

tests/test_smoke.py - modify - Update reward assertions for non-terminal steps

Interface Changes:

None (test-only changes)

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: Low

Merge Criteria:

Tests from VERIFICATION_SPEC.md pass
No TODOs left in changed code (or explicitly tracked)
Backwards compatible (or flag/migration documented)

Status: Completed

Completed: 2026-03-28T06:05:02Z Changes Made:

tests/test_smoke.py: Updated non-terminal action assertions to validate dense reward values instead of implicit None semantics.
tests/test_smoke.py: Added concrete reward checks for DESCRIBE/SAMPLE (0.015), QUERY positive reward, non-SELECT QUERY penalty (-0.005), and first-step budget exhaustion reward behavior.

Result:

Outcome: PASS

Evidence Captured:

Command: uv run --with pytest pytest tests/test_smoke.py -v
Result: 25 passed in 4.04s

Command: uv run --with pytest pytest tests/ -v
Result: 166 passed, 1 skipped in 4.29s

Verifier: APPROVED (high confidence, no critical findings)

Tests run:
- uv run --with pytest pytest tests/test_smoke.py -v
- uv run --with pytest pytest tests/ -v
Notes:
- uv run pytest ... fails in this repository because pytest is not installed in the project environment; verification used uv run --with pytest ... while staying package-manager scoped.
- Assertions now align with dense-reward behavior and reinforce terminality checks via done rather than reward is None for non-terminal steps.
- Finalization included verifier approval, behavior-delta archival, and durable learning extraction.
Issues: None
Follow-ups Created: None
Human Review Completed: N/A

Context for Next Step:

Implementation steps are complete; proceed with /commit-push-pr when ready.

8. Rollout Considerations

Feature Flags

Required: No
Flag name: N/A

Migration

Data migration needed: No

Rollback Plan

Remove the compute_step_reward() call from step() and revert self._last_reward = None for non-ANSWER actions. The new EpisodeContext fields are harmless if unused.

9. Execution Tracking

All execution state is tracked within this document:

Section 1a: Overall progress summary
Section 7: Per-step completion details, test results, and handoff context
FEATURES.json: Feature-level status/progress metadata used by /autocode-next-step and opencode-ctx ralph run
Git history: Full audit trail of changes to this file

The implementing agent updates this document after each step and keeps the matching FEATURES.json entry in sync during implementation/finalization. Humans can monitor progress by:

Checking Section 1a for summary
Reviewing Section 7 for detailed step status
Inspecting the feature's progress and status fields in FEATURES.json
Running git log --oneline IMPLEMENTATION_SPEC.md for change history

9a. Slice Completion Protocol

After all steps in a slice pass verification:

Run verifier subagent for spec compliance
- Validates against VERIFICATION_SPEC.md criteria
- Ensures no TODOs or incomplete work in slice
Run compound-engineer subagent to extract learnings
- Mandatory invocation after every slice completion
- Updates CLAUDE.md Learnings section (if durable patterns found)
- May exit with "no update needed" (valid for routine work)
Commit the slice changes
- Follow commit message format in CLAUDE.md
- Each slice gets its own atomic commit
Continue to next slice (if more slices remain)
- Or proceed to final verification if all slices complete

Note: PR creation happens only after ALL slices are complete. Use /commit-push-pr manually when ready.

10. User Value Summary

Status: Generated

What Users Can Now Do

Agents now receive meaningful numeric reward feedback on every non-terminal SQL exploration step, not just terminal correctness at ANSWER time.

How to Access/Test

Run a normal episode (reset then DESCRIBE/SAMPLE/QUERY) and observe per-step observation.reward values changing with execution quality and answer progress.

Demo

Command: uv run --with pytest pytest tests/test_smoke.py -v
Proof points: DESCRIBE/SAMPLE rewards are 0.015, invalid non-SELECT QUERY gets -0.005, QUERY returns positive dense reward, terminal budget-exhaustion still yields 0.0.

Release Notes Snippet

Dense 3-layer reward shaping is now fully integrated: all non-terminal actions emit numeric rewards, repeat/farming controls are enforced, progress-to-answer rewards are gated by improvement, and terminal correctness remains dominant.

11. PR Contract (Auto-Generated by autocode-next-step)

Status: Generated

Scope Delivered

Dense reward system implemented across models.py, server/reward.py, server/sql_environment.py, and test coverage updates in tests/test_smoke.py and tests/unit/test_reward.py.
Final non-terminal reward assertions now match shipped behavior and protect against regressions.

Verification Evidence

uv run --with pytest pytest tests/test_smoke.py -v -> 25 passed
uv run --with pytest pytest tests/ -v -> 166 passed, 1 skipped
Verifier subagent verdict: approved (high confidence, no critical findings)

Risks and Mitigations

Risk: Legacy callers infer terminality from reward is None.
Mitigation: Behavior spec now documents terminality contract based on done; smoke tests enforce non-terminal numeric rewards.

Follow-up

Ready for commit/PR via /commit-push-pr.

Stop Conditions (When to Split This Spec)

Stop and create a new IMPLEMENTATION_SPEC if:

A step requires touching more than 3 files in unrelated areas
You need to introduce multiple new abstractions "just in case"
Verification cannot be made targeted and concrete
You discover new unknowns that change the plan materially
The next slice cannot be merged safely without finishing later slices

When splitting, ensure the current slice ends in a merged, stable state.

Human Checkpoint

Before handing to AI agent:

Interface specifications are complete
Data flow is accurate
Error handling is specified
Implementation order makes sense
VERIFICATION_SPEC.md has been generated

Questions:

None

Handoff Notes

For the implementing AI agent:

Context: See RESEARCH_SUMMARY.md for system understanding
Spec: Follow this document exactly
Verification: Use tests from VERIFICATION_SPEC.md (independent agent)
Ambiguity: Stop and ask rather than assume
Order: Follow implementation order exactly
Key decisions already made:
  - Layer 2 weights: 0.25 cardinality, 0.50 value overlap, 0.25 numeric range (fixed)
  - gold_rows stored in EpisodeContext, populated at reset()
  - Progress bins: {0, 0.25, 0.5, 0.75, 1.0}
  - Clamping: [-0.2, +0.5] cumulative step reward
  - Pure Python only, no numpy/scipy

Specification completed: 2026-03-27 Verification input: specs/F003-VERIFICATION_INPUT.json Target agent: Claude Code