Implementation Specification
Change: F003 -- Dense Reward System (3-layer reward architecture) Date: 2026-03-27 Research Summary: specs/F003-RESEARCH_SUMMARY.md Verification Spec: See VERIFICATION_SPEC.md (generated by autocode-verification-planner) Behavior Delta: Archived to specs/behavior/sql-environment.md PR: https://github.com/hjerpe/sql-env/pull/9
Plan Status:
- Draft
- Approved for Implementation
- Implementation Complete
- Verification Passed
Core Intent (Immutable)
DO NOT MODIFY THIS SECTION DURING REFINEMENT Changes to Core Intent mean you're describing a different feature. If refinement reveals the need to change this section, create a new feature instead.
User Problem: Agents get meaningful feedback during exploration -- not just 0/1 at the end. A query that returns 40 when the answer is 42 gets partial credit. Discovering new schema info gets a small reward. This makes GRPO training converge.
Success Criteria:
- Reward varies meaningfully: random exploration ~0.1, targeted queries ~0.3, correct answer ~1.3
- Anti-gaming works: agent cannot farm rewards by repeating queries or describing everything
- Progress signal coarsened to 5 bins to prevent reward hill-climbing
Avoid:
- Reward hacking (agent exploiting shaping signals to inflate reward without solving the task)
- Reward too sparse (no signal until terminal step defeats the purpose of dense rewards)
- Over-complex reward that is hard to debug (keep each layer simple and independently testable)
Out of Scope:
- Adaptive/learned reward weights (use fixed weights: 0.25/0.50/0.25)
- Row-wise best-match alignment (add later if training shows need)
- NumPy/SciPy dependencies (pure Python only)
- Reward strategy classes or plugin architecture
- F002 verifier integration (Layer 3 uses existing naive check)
0. Slicing & Scope Budget (Anti-Waterfall)
This spec must be executable in small, mergeable increments.
Scope Budget
- Target: 3 slices
- Hard max: <= 10 steps total
- Each step must end in: implement -> verify -> merge
Slice Definition
A slice is a vertical increment that delivers user-visible value or a safe internal capability.
Each slice must have:
- Clear outcome
- Minimal interface change
- Merge criteria
Note: Verification criteria are defined in VERIFICATION_SPEC.md (separate agent).
Status Icons
Step Status:
- Not Started
- [~] In Progress
- Completed
- [!] Blocked/Failed
Result Outcome:
- PASS: Fully Successful (all tests passed, no issues)
- WARN: Completed with Issues (needs follow-up)
- FAIL: Failed/Blocked
1. Implementation Overview
Summary
Implement the 3-layer reward architecture in server/reward.py and wire it into SQLEnvironment.step(). Layer 1 provides operational signals (exec_ok, new_info, repeat penalty, step cost). Layer 2 computes progress-to-target for QUERY actions using a fixed weighted average of cardinality matching (0.25), value overlap (0.50), and numeric range proximity (0.25), binned to 5 levels with improvement-only gating. Layer 3 remains the existing terminal correctness signal. New reward-tracking fields are added to EpisodeContext, and gold_rows are cached at reset(). Existing tests that assert reward=None for non-terminal steps are updated.
Scope
In Scope:
server/reward.py:compute_step_reward(), Layer 1, Layer 2 with all sub-metrics, binningmodels.py: New fields onEpisodeContext(gold_rows,query_hashes,best_progress,cumulative_step_reward,cumulative_new_info_reward)server/sql_environment.py: Wirecompute_step_reward()intostep(), storegold_rowsatreset()- Test updates for non-None step rewards
Out of Scope:
- F002 verifier integration (Layer 3 uses existing
_handle_answer) - Adaptive reward weights
- Row-wise best-match alignment
- NumPy/SciPy dependencies
1a. Execution Status
Progress: 7/7 steps complete Current Step: Finalization complete Last Updated: 2026-03-28T06:05:02Z Latest Result: PASS - Step 3.2 completed and final verification approved Blockers: None
1b. Risk Assessment
Risk Tier: Low
Risk Tier Definitions:
- Low: Pure logic, non-user-facing, no security implications
- Medium: User input handling, data validation, API changes
- High: Authentication, payments, secrets management, untrusted input
High-Risk Indicators Present: None
Security Review Required: No
Justification: Pure computation logic operating on in-memory data structures. No user input handling, no network I/O, no authentication. All inputs are already validated by the environment before reaching reward functions.
2. Change Manifest
Files to Create
None (all files already exist).
Files to Modify
| File | Changes |
|---|---|
models.py |
Add 5 new fields to EpisodeContext dataclass |
server/reward.py |
Implement full reward module: compute_step_reward, Layer 1, Layer 2, sub-metrics, binning |
server/sql_environment.py |
Store gold_rows at reset(), call compute_step_reward() in step() |
tests/test_smoke.py |
Update assertions that expect reward=None for non-terminal steps |
Files to Delete
None.
3. Interface Specifications
Modified Types
# Location: models.py
# CHANGE: Add reward-tracking fields to EpisodeContext
@dataclass
class EpisodeContext:
"""Per-episode server-side state (never sent to agent)."""
episode_id: str
db_connection: sqlite3.Connection
question_record: QuestionRecord
step_count: int = 0
budget: int = 15
described_tables: set[str] = dataclass_field(default_factory=set)
action_log: list[str] = dataclass_field(default_factory=list)
done: bool = False
gold_answer: str | None = None
# --- NEW fields for F003 ---
gold_rows: list[tuple] = dataclass_field(default_factory=list)
query_hashes: set[str] = dataclass_field(default_factory=set)
best_progress: float = 0.0
cumulative_step_reward: float = 0.0
cumulative_new_info_reward: float = 0.0
New Functions
# Location: server/reward.py
def compute_step_reward(
ctx: EpisodeContext,
action_type: str,
sql: str,
rows: list[tuple] | None,
error: str | None,
) -> float:
"""
Compute dense reward for a single non-terminal step.
Combines Layer 1 (operational) and Layer 2 (progress) signals.
Clamps running total of step rewards to [-0.2, +0.5].
Args:
ctx: Current episode context (mutated: updates tracking fields).
action_type: One of DESCRIBE, SAMPLE, QUERY.
sql: The SQL string executed (used for repeat detection).
rows: Result rows from query execution, or None if error.
error: Error message if action failed, else None.
Returns:
Step reward (float). Also updates ctx.cumulative_step_reward.
"""
def _layer1_operational(
ctx: EpisodeContext,
action_type: str,
sql: str,
rows: list[tuple] | None,
error: str | None,
) -> float:
"""
Layer 1: Operational reward signals.
Components:
- exec_ok: +0.02 if query executed without error
- new_info: +0.01 per new table discovered (capped at 0.10 cumulative)
- repeat: -0.01 if exact query hash seen before
- step_cost: -0.005 always
Args:
ctx: Episode context (mutated: updates query_hashes, cumulative_new_info_reward).
action_type: Action type string.
sql: SQL string for hash-based repeat detection.
rows: Result rows (used to confirm exec_ok).
error: Error message if action failed.
Returns:
Layer 1 reward component (float).
"""
def _layer2_progress(
ctx: EpisodeContext,
rows: list[tuple],
) -> float:
"""
Layer 2: Progress-to-target for QUERY actions only.
Computes weighted average of sub-metrics, bins to 5 levels,
rewards only improvement over best-so-far, scaled by 0.15.
Args:
ctx: Episode context (mutated: updates best_progress).
rows: Query result rows to compare against ctx.gold_rows.
Returns:
Layer 2 reward component (float). 0.0 if no improvement.
"""
def _cardinality_score(pred_rows: list[tuple], gold_rows: list[tuple]) -> float:
"""
Row count similarity: 1 - |len(pred) - len(gold)| / max(len(pred), len(gold), 1).
Returns:
Score in [0.0, 1.0].
"""
def _value_overlap_score(pred_rows: list[tuple], gold_rows: list[tuple]) -> float:
"""
Jaccard overlap of flattened cell values (as strings).
Returns:
Score in [0.0, 1.0].
"""
def _numeric_range_score(pred_rows: list[tuple], gold_rows: list[tuple]) -> float:
"""
Log-distance proximity for numeric cells.
For each numeric value in gold, find closest numeric in pred.
Score = mean(1 / (1 + log(1 + |pred - gold|))) across gold numerics.
Returns 1.0 if no numeric values in gold.
Returns:
Score in [0.0, 1.0].
"""
def _bin_progress(raw_score: float) -> float:
"""
Bin raw progress score to {0, 0.25, 0.5, 0.75, 1.0}.
Thresholds: [0, 0.125) -> 0, [0.125, 0.375) -> 0.25,
[0.375, 0.625) -> 0.5, [0.625, 0.875) -> 0.75, [0.875, 1.0] -> 1.0.
Returns:
Binned score.
"""
4. Data Flow
Primary Flow (Non-terminal step with QUERY action)
1. step() receives action (QUERY, sql_string)
- Input: SQLAction with action_type="QUERY", argument=sql
2. step() dispatches to _handle_query(sql)
- Action: Executes SQL, returns formatted result
- Side effect: Stores raw rows internally
3. step() calls compute_step_reward(ctx, "QUERY", sql, rows, error)
- Input: episode context, action metadata, raw query rows
4. compute_step_reward calls _layer1_operational(ctx, "QUERY", sql, rows, None)
- Computes: exec_ok(+0.02) + new_info(+0.01 if new tables) + repeat(-0.01 if seen) + step_cost(-0.005)
- Side effect: Updates ctx.query_hashes, ctx.cumulative_new_info_reward
5. compute_step_reward calls _layer2_progress(ctx, rows)
- Computes: weighted avg of cardinality(0.25) + value_overlap(0.50) + numeric_range(0.25)
- Bins to {0, 0.25, 0.5, 0.75, 1.0}
- Returns improvement * 0.15 (only if binned > ctx.best_progress)
- Side effect: Updates ctx.best_progress
6. compute_step_reward clamps cumulative to [-0.2, +0.5]
- Output: clamped step reward (float)
- Side effect: Updates ctx.cumulative_step_reward
Alternative Flows
When action is DESCRIBE or SAMPLE:
1. step() dispatches to _handle_describe() or _handle_sample()
2. compute_step_reward calls _layer1_operational only (Layer 2 skipped)
3. Clamping applied as usual
When QUERY has SQL error:
1. _handle_query raises sqlite3.Error
2. step() catches error, sets self._last_error
3. compute_step_reward called with error=str(exc), rows=None
4. Layer 1: step_cost only (-0.005), no exec_ok
5. Layer 2: skipped (rows is None)
When gold_rows is empty:
1. _layer2_progress detects ctx.gold_rows is empty
2. Returns 0.0 (skip Layer 2 entirely)
When budget exhausted without ANSWER:
1. step() sets done=True, reward=0.0 (terminal)
2. No compute_step_reward call for this terminal step
5. Error Handling
Error Types
| Error | When | Impact |
|---|---|---|
| SQL execution error | Invalid query syntax / runtime error | Layer 1: step_cost only, Layer 2 skipped |
| Empty gold_rows | Gold SQL returned no rows | Layer 2 returns 0.0, Layer 1 operates normally |
| Division by zero in metrics | Both pred and gold are empty | Protected by max(..., 1) denominators |
Error Handling Strategy
# In compute_step_reward:
# - No exceptions should propagate; all edge cases return safe defaults
# - If error is not None, skip exec_ok and Layer 2
# - If rows is None, skip Layer 2
# - If gold_rows is empty, skip Layer 2
Retry Strategy
| Operation | Retry? | Strategy |
|---|---|---|
| Reward computation | No | Pure function, deterministic, no I/O |
6. Slice Plan (What we will ship, in order)
Slice S1 -- EpisodeContext Fields + Layer 1
Value: Every non-terminal step returns a small but meaningful reward signal based on operational quality
User-visible change: Yes -- step observations now include non-None reward values
Interfaces introduced/changed: 5 new fields on EpisodeContext, compute_step_reward(), _layer1_operational()
Rollback safety: Additive only -- new fields have defaults, reward.py is new code
Slice S2 -- Layer 2 Progress Metrics
Value: QUERY actions receive progress-toward-answer signal, enabling convergent GRPO training
User-visible change: Yes -- QUERY step rewards now reflect closeness to gold answer
Interfaces introduced/changed: _layer2_progress(), _cardinality_score(), _value_overlap_score(), _numeric_range_score(), _bin_progress()
Rollback safety: Additive to reward.py, no external interface changes
Slice S3 -- Wire into step() + Test Updates
Value: Full system integration -- environment returns dense rewards on every step
User-visible change: Yes -- complete dense reward signal in step observations
Interfaces introduced/changed: sql_environment.py:step() modified, sql_environment.py:reset() modified
Rollback safety: Reversible by removing compute_step_reward call from step()
7. Implementation Steps
VERIFICATION NOTE: Test criteria for each step are defined in VERIFICATION_SPEC.md. The verification-planner (separate agent) generated independent test criteria. Run the tests specified there after implementing each step.
Step 1.1: Add reward-tracking fields to EpisodeContext
Slice: S1 Goal: Extend EpisodeContext with the 5 new fields required for reward tracking.
Files:
models.py- modify - Addgold_rows,query_hashes,best_progress,cumulative_step_reward,cumulative_new_info_rewardfields
Interface Changes:
EpisodeContextdataclass gains 5 new fields (all with defaults, backward-compatible)
Verification:
See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
Risk Tier for This Step: Low
Merge Criteria:
- Tests from VERIFICATION_SPEC.md pass
- No TODOs left in changed code (or explicitly tracked)
- Backwards compatible (or flag/migration documented)
Status: Completed
Completed: 2026-03-27T23:51:47Z Changes Made:
models.py: AddedEpisodeContextreward-tracking defaults forgold_rows,query_hashes,best_progress,cumulative_step_reward, andcumulative_new_info_reward.tests/unit/test_reward.py: Added EpisodeContext-focused unit tests for new default fields and tuple-listgold_rowsstorage.
Result:
- Outcome: PASS
- Evidence Captured:
Command: uv run --with pytest pytest tests/unit/test_reward.py -v -k "EpisodeContext" Result: 6 passed in 3.92s - Tests run:
uv run --with pytest pytest tests/unit/test_reward.py -v -k "EpisodeContext" - Notes:
tests/unit/test_reward.pydid not exist yet, so it was created to match verification spec coverage for EpisodeContext.- Used
--with pytestbecause bareuv run pytest ...fails in this repo due missing local pytest executable. - Field additions are additive and backward compatible via defaults.
- Issues: None
- Follow-ups Created: None
- Human Review Completed: N/A
Context for Next Step:
- EpisodeContext now has all fields needed by reward functions
Step 1.2: Implement Layer 1 operational rewards
Slice: S1
Goal: Implement _layer1_operational() with exec_ok, new_info, repeat penalty, and step_cost signals.
Files:
server/reward.py- modify - Implement_layer1_operational()function
Interface Changes:
- New function
_layer1_operational(ctx, action_type, sql, rows, error) -> float
Verification:
See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
Risk Tier for This Step: Low
Merge Criteria:
- Tests from VERIFICATION_SPEC.md pass
- No TODOs left in changed code (or explicitly tracked)
- Backwards compatible (or flag/migration documented)
Status: Completed
Completed: 2026-03-27T23:54:50Z Changes Made:
server/reward.py: Implemented_layer1_operational()with step cost, exec-ok signal, repeat-query penalty, and capped new-info accumulation tracked onEpisodeContext.tests/unit/test_reward.py: AddedTestLayer1Operationalcoverage for successful actions, SQL error behavior, repeat penalties, and new-info cap behavior.
Result:
- Outcome: PASS
- Evidence Captured:
Command: uv run --with pytest pytest tests/unit/test_reward.py -v -k "layer1" Result: 8 passed, 6 deselected in 3.89s - Tests run:
uv run --with pytest pytest tests/unit/test_reward.py -v -k "layer1" - Notes:
uv run pytest ...still fails in this repo becausepytestis not installed in the project environment; useduv run --with pytest ...to satisfy package-manager execution policy.- Repeat detection uses SHA-256 of the exact SQL string and suppresses
exec_okon repeated successful QUERY actions. - New-info reward is only granted on first-seen successful QUERY actions and is capped at 0.10 cumulative per episode.
- Issues: None
- Follow-ups Created: None
- Human Review Completed: N/A
Context for Next Step:
- Layer 1 operational shaping is complete and covered by unit tests; proceed with Layer 2 pure scoring helpers in
server/reward.py.
Step 2.1: Implement Layer 2 sub-metrics
Slice: S2
Goal: Implement _cardinality_score(), _value_overlap_score(), _numeric_range_score(), and _bin_progress().
Files:
server/reward.py- modify - Add all four sub-metric functions
Interface Changes:
- 4 new pure functions (no state mutation)
Verification:
See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
Risk Tier for This Step: Low
Merge Criteria:
- Tests from VERIFICATION_SPEC.md pass
- No TODOs left in changed code (or explicitly tracked)
- Backwards compatible (or flag/migration documented)
Status: Completed
Completed: 2026-03-27T23:58:44Z Changes Made:
server/reward.py: Added pure Layer 2 helper functions_cardinality_score(),_value_overlap_score(),_numeric_range_score(), and_bin_progress()with bounded outputs and edge-case handling.tests/unit/test_reward.py: Added dedicated unit test coverage for all four sub-metrics, including boundary thresholds, empty inputs, mixed types, and numeric distance behavior.
Result:
- Outcome: PASS
- Evidence Captured:
Command: uv run --with pytest pytest tests/unit/test_reward.py -v -k "cardinality or value_overlap or numeric_range or bin_progress" Result: 34 passed, 14 deselected in 5.06s - Tests run:
uv run --with pytest pytest tests/unit/test_reward.py -v -k "cardinality or value_overlap or numeric_range or bin_progress" - Notes:
- Implemented
_bin_progress()with explicit clamping to[0.0, 1.0]before threshold binning. - Numeric range scoring excludes booleans from numeric extraction to avoid
bool/intcoercion artifacts. - All helpers are pure and deterministic, with no mutation of
EpisodeContext.
- Implemented
- Issues: None
- Follow-ups Created: None
- Human Review Completed: N/A
Context for Next Step:
- Layer 2 helper metrics are now stable and tested; proceed to compose them in
_layer2_progress()with weighted averaging and improvement-only gating.
Step 2.2: Implement Layer 2 progress composition
Slice: S2
Goal: Implement _layer2_progress() that combines sub-metrics with fixed weights (0.25/0.50/0.25), bins, and gates on improvement.
Files:
server/reward.py- modify - Add_layer2_progress()function
Interface Changes:
- New function
_layer2_progress(ctx, rows) -> float
Verification:
See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
Risk Tier for This Step: Low
Merge Criteria:
- Tests from VERIFICATION_SPEC.md pass
- No TODOs left in changed code (or explicitly tracked)
- Backwards compatible (or flag/migration documented)
Status: Completed
Completed: 2026-03-28T00:03:22Z Changes Made:
server/reward.py: Implemented_layer2_progress()using the fixed weighted composition (0.25/0.50/0.25), progress binning, improvement-only gating, andctx.best_progressmutation on improvement.tests/unit/test_reward.py: AddedTestLayer2Progresscoverage for perfect match, no-improvement gating, incremental improvement rewards, empty-gold behavior, weighted-average outcome, best-progress updates, and non-downgrade behavior.
Result:
- Outcome: PASS
- Evidence Captured:
Command: uv run --with pytest pytest tests/unit/test_reward.py -v -k "layer2" Result: 7 passed, 48 deselected in 3.83s - Tests run:
uv run --with pytest pytest tests/unit/test_reward.py -v -k "layer2" - Notes:
- Implemented explicit constants for Layer 2 weights and improvement scale to keep composition intent readable and stable.
_layer2_progress()returns zero whengold_rowsis empty and never reducesctx.best_progress.uv run pytest ...still requires--with pytestin this repository due missing local pytest executable.
- Issues: None
- Follow-ups Created: None
- Human Review Completed: N/A
Context for Next Step:
- Layer 2 composition is now complete and tested; next implement
compute_step_reward()to combine Layer 1 + Layer 2 and apply cumulative clamping.
Step 2.3: Implement compute_step_reward with clamping
Slice: S2
Goal: Implement the main compute_step_reward() entry point that combines Layer 1 and Layer 2, applies clamping to [-0.2, +0.5].
Files:
server/reward.py- modify - Addcompute_step_reward()function
Interface Changes:
- New public function
compute_step_reward(ctx, action_type, sql, rows, error) -> float
Verification:
See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
Risk Tier for This Step: Low
Merge Criteria:
- Tests from VERIFICATION_SPEC.md pass
- No TODOs left in changed code (or explicitly tracked)
- Backwards compatible (or flag/migration documented)
Status: Completed
Completed: 2026-03-28T00:06:56Z Changes Made:
server/reward.py: Implementedcompute_step_reward()to compose Layer 1 and (QUERY-only) Layer 2 signals, then clamp cumulative step shaping to[-0.2, +0.5]while returning the per-step clamped delta.tests/unit/test_reward.py: AddedTestComputeStepRewardcoverage for query success/error paths, DESCRIBE/SAMPLE behavior, upper/lower clamp boundaries, clamp delta semantics, context mutation, and Layer 2 skip conditions.
Result:
- Outcome: PASS
- Evidence Captured:
Command: uv run --with pytest pytest tests/unit/test_reward.py -v -k "compute_reward" Result: 11 passed, 55 deselected in 3.84s - Tests run:
uv run --with pytest pytest tests/unit/test_reward.py -v -k "compute_reward" - Notes:
compute_step_reward()now updatesctx.cumulative_step_rewardthrough clamp-aware delta computation so boundaries are enforced deterministically.- Layer 2 is only evaluated for successful
QUERYactions (rows is not Noneanderror is None) to keep non-query and error behavior aligned with spec. - Verification command from spec (
-k "compute_step_reward") currently selects zero tests because test names usecompute_reward; used-k "compute_reward"to execute the intended step suite.
- Issues: None
- Follow-ups Created: None
- Human Review Completed: N/A
Context for Next Step:
- Reward composition and clamp behavior are complete; next wire
compute_step_reward()into environmentreset()/step()flow and expose query rows for Layer 2 integration.
Step 3.1: Wire reward into step() and reset()
Slice: S3
Goal: Store gold_rows in EpisodeContext at reset(). Call compute_step_reward() from step() for non-terminal actions. Expose raw query rows for Layer 2.
Files:
server/sql_environment.py- modify - Updatereset()to store gold_rows, updatestep()to call compute_step_reward, track raw query rows from_handle_query
Interface Changes:
reset(): Storesgold_rowsin EpisodeContextstep(): Setsself._last_rewardfromcompute_step_reward()for non-ANSWER actions
Verification:
See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
Risk Tier for This Step: Low
Merge Criteria:
- Tests from VERIFICATION_SPEC.md pass
- No TODOs left in changed code (or explicitly tracked)
- Backwards compatible (or flag/migration documented)
Status: Completed
Completed: 2026-03-28T05:56:43Z Changes Made:
server/sql_environment.py: Importedcompute_step_rewardand wired dense reward calculation intostep()for all non-terminal valid actions.server/sql_environment.py: Updated_handle_query()to return both formatted output and raw SQL rows so QUERY actions feed Layer 2 progress scoring.server/sql_environment.py: Preserved terminal budget behavior by skipping dense reward computation when the step exhausts budget (terminal reward remains0.0).
Result:
- Outcome: PASS
- Evidence Captured:
Command: uv run --with pytest pytest tests/unit/test_reward.py -v -k "compute_reward or layer1 or layer2" Result: 26 passed, 40 deselected in 4.85s Command: uv run --with pytest pytest tests/test_smoke.py -v -k "describe_reveals_columns_and_updates_schema or sample_and_query_success or query_rejects_non_select or budget_exhaustion_sets_done_and_zero_reward or query_timeout_returns_error" Result: 5 passed, 20 deselected in 4.12s - Tests run:
uv run --with pytest pytest tests/unit/test_reward.py -v -k "compute_reward or layer1 or layer2"uv run --with pytest pytest tests/test_smoke.py -v -k "describe_reveals_columns_and_updates_schema or sample_and_query_success or query_rejects_non_select or budget_exhaustion_sets_done_and_zero_reward or query_timeout_returns_error"
- Notes:
- Dense shaping now executes in the environment action loop for non-terminal steps while keeping ANSWER and budget-exhaustion terminal reward semantics unchanged.
- QUERY actions now pass raw rows through to reward computation; DESCRIBE/SAMPLE paths compute Layer 1-only reward.
- Used
uv run --with pytest ...due localuv run pytest ...executable mismatch in this repository environment.
- Issues: None
- Follow-ups Created: None
- Human Review Completed: N/A
Context for Next Step:
- Existing smoke tests still assert
reward is Nonefor reset and non-terminal paths; update those assertions to match dense reward behavior.
Step 3.2: Update existing tests for dense rewards
Slice: S3
Goal: Update tests in tests/test_smoke.py that assert reward=None for non-terminal steps to expect numeric reward values instead.
Files:
tests/test_smoke.py- modify - Update reward assertions for non-terminal steps
Interface Changes:
- None (test-only changes)
Verification:
See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
Risk Tier for This Step: Low
Merge Criteria:
- Tests from VERIFICATION_SPEC.md pass
- No TODOs left in changed code (or explicitly tracked)
- Backwards compatible (or flag/migration documented)
Status: Completed
Completed: 2026-03-28T06:05:02Z Changes Made:
tests/test_smoke.py: Updated non-terminal action assertions to validate dense reward values instead of implicitNonesemantics.tests/test_smoke.py: Added concrete reward checks for DESCRIBE/SAMPLE (0.015), QUERY positive reward, non-SELECT QUERY penalty (-0.005), and first-step budget exhaustion reward behavior.
Result:
- Outcome: PASS
- Evidence Captured:
Command: uv run --with pytest pytest tests/test_smoke.py -v Result: 25 passed in 4.04s Command: uv run --with pytest pytest tests/ -v Result: 166 passed, 1 skipped in 4.29s Verifier: APPROVED (high confidence, no critical findings) - Tests run:
uv run --with pytest pytest tests/test_smoke.py -vuv run --with pytest pytest tests/ -v
- Notes:
uv run pytest ...fails in this repository becausepytestis not installed in the project environment; verification useduv run --with pytest ...while staying package-manager scoped.- Assertions now align with dense-reward behavior and reinforce terminality checks via
donerather thanreward is Nonefor non-terminal steps. - Finalization included verifier approval, behavior-delta archival, and durable learning extraction.
- Issues: None
- Follow-ups Created: None
- Human Review Completed: N/A
Context for Next Step:
- Implementation steps are complete; proceed with
/commit-push-prwhen ready.
8. Rollout Considerations
Feature Flags
- Required: No
- Flag name: N/A
Migration
- Data migration needed: No
Rollback Plan
Remove the compute_step_reward() call from step() and revert self._last_reward = None for non-ANSWER actions. The new EpisodeContext fields are harmless if unused.
9. Execution Tracking
All execution state is tracked within this document:
- Section 1a: Overall progress summary
- Section 7: Per-step completion details, test results, and handoff context
- FEATURES.json: Feature-level status/progress metadata used by
/autocode-next-stepandopencode-ctx ralph run - Git history: Full audit trail of changes to this file
The implementing agent updates this document after each step and keeps the matching FEATURES.json entry in sync during implementation/finalization. Humans can monitor progress by:
- Checking Section 1a for summary
- Reviewing Section 7 for detailed step status
- Inspecting the feature's
progressandstatusfields inFEATURES.json - Running
git log --oneline IMPLEMENTATION_SPEC.mdfor change history
9a. Slice Completion Protocol
After all steps in a slice pass verification:
Run verifier subagent for spec compliance
- Validates against VERIFICATION_SPEC.md criteria
- Ensures no TODOs or incomplete work in slice
Run compound-engineer subagent to extract learnings
- Mandatory invocation after every slice completion
- Updates CLAUDE.md Learnings section (if durable patterns found)
- May exit with "no update needed" (valid for routine work)
Commit the slice changes
- Follow commit message format in CLAUDE.md
- Each slice gets its own atomic commit
Continue to next slice (if more slices remain)
- Or proceed to final verification if all slices complete
Note: PR creation happens only after ALL slices are complete. Use /commit-push-pr manually when ready.
10. User Value Summary
Status: Generated
What Users Can Now Do
Agents now receive meaningful numeric reward feedback on every non-terminal SQL exploration step, not just terminal correctness at ANSWER time.
How to Access/Test
Run a normal episode (reset then DESCRIBE/SAMPLE/QUERY) and observe per-step observation.reward values changing with execution quality and answer progress.
Demo
- Command:
uv run --with pytest pytest tests/test_smoke.py -v - Proof points: DESCRIBE/SAMPLE rewards are
0.015, invalid non-SELECT QUERY gets-0.005, QUERY returns positive dense reward, terminal budget-exhaustion still yields0.0.
Release Notes Snippet
Dense 3-layer reward shaping is now fully integrated: all non-terminal actions emit numeric rewards, repeat/farming controls are enforced, progress-to-answer rewards are gated by improvement, and terminal correctness remains dominant.
11. PR Contract (Auto-Generated by autocode-next-step)
Status: Generated
Scope Delivered
- Dense reward system implemented across
models.py,server/reward.py,server/sql_environment.py, and test coverage updates intests/test_smoke.pyandtests/unit/test_reward.py. - Final non-terminal reward assertions now match shipped behavior and protect against regressions.
Verification Evidence
uv run --with pytest pytest tests/test_smoke.py -v-> 25 passeduv run --with pytest pytest tests/ -v-> 166 passed, 1 skipped- Verifier subagent verdict: approved (high confidence, no critical findings)
Risks and Mitigations
- Risk: Legacy callers infer terminality from
reward is None. - Mitigation: Behavior spec now documents terminality contract based on
done; smoke tests enforce non-terminal numeric rewards.
Follow-up
- Ready for commit/PR via
/commit-push-pr.
Stop Conditions (When to Split This Spec)
Stop and create a new IMPLEMENTATION_SPEC if:
- A step requires touching more than 3 files in unrelated areas
- You need to introduce multiple new abstractions "just in case"
- Verification cannot be made targeted and concrete
- You discover new unknowns that change the plan materially
- The next slice cannot be merged safely without finishing later slices
When splitting, ensure the current slice ends in a merged, stable state.
Human Checkpoint
Before handing to AI agent:
- Interface specifications are complete
- Data flow is accurate
- Error handling is specified
- Implementation order makes sense
- VERIFICATION_SPEC.md has been generated
Questions:
- None
Handoff Notes
For the implementing AI agent:
Context: See RESEARCH_SUMMARY.md for system understanding
Spec: Follow this document exactly
Verification: Use tests from VERIFICATION_SPEC.md (independent agent)
Ambiguity: Stop and ask rather than assume
Order: Follow implementation order exactly
Key decisions already made:
- Layer 2 weights: 0.25 cardinality, 0.50 value overlap, 0.25 numeric range (fixed)
- gold_rows stored in EpisodeContext, populated at reset()
- Progress bins: {0, 0.25, 0.5, 0.75, 1.0}
- Clamping: [-0.2, +0.5] cumulative step reward
- Pure Python only, no numpy/scipy
Specification completed: 2026-03-27 Verification input: specs/F003-VERIFICATION_INPUT.json Target agent: Claude Code