Feature Demo: F003 — Dense Reward System
Generated: 2026-03-28T06:07:34Z Context source: spec + discovery only (implementation not read) Feature entry: FEATURES.json #F003
What This Feature Does
Before this feature, agents only got a binary reward at the end of an episode, which made exploration hard to learn from. With F003, agents now get small, meaningful reward signals during non-terminal DESCRIBE/SAMPLE/QUERY steps, plus the final terminal correctness reward.
From the user perspective, this means random exploration should produce low cumulative reward, targeted exploration should produce higher reward, and anti-gaming controls should prevent farming rewards via repeated or low-value behavior.
What Is Already Proven
Verified in This Demo Run
- Happy-path SQL exploration smoke flow passes locally.
- Non-SELECT query error handling passes locally.
- Budget-exhaustion terminal reward behavior passes locally.
- Clamp boundary unit tests for step-reward floor/ceiling pass locally.
- Full smoke suite passes locally (25/25).
Previously Verified Evidence
specs/FEATURES.jsonrecords verifier-approved evidence for F003:uv run --with pytest pytest tests/ -vwith166 passed.specs/F003-IMPLEMENTATION_SPEC.md(Section 7, Step 3.2) records final verification evidence and verifier approval.specs/F003-VERIFICATION_SPEC.mddefines unit/integration/e2e scenarios and edge-case checklist used for this demo plan.
What Still Needs User Verification
- Run a real episode manually (
reset→DESCRIBE/SAMPLE/QUERY/ANSWER) and inspect liveobservation.rewardprogression across steps. - Confirm training-facing calibration in your own workload (random exploration ~0.1, targeted ~0.3, correct answer total ~1.3) under your runtime conditions.
Quickstart / Verification Steps
Run these commands to see the feature in action:
uv run --with pytest pytest tests/test_smoke.py -v -k "sample_and_query_success"
uv run --with pytest pytest tests/test_smoke.py -v -k "query_rejects_non_select"
uv run --with pytest pytest tests/unit/test_reward.py -v -k "compute_reward_clamp_upper or compute_reward_clamp_lower"
No extra setup was needed in this environment beyond project dependencies.
Live Local Proof
This feature is internal server-side reward logic (no direct end-user CLI command for reward computation itself), so strongest truthful local proof is targeted runtime smoke/unit execution.
Run a happy-path exploration step flow
This validates a representative non-terminal exploration path.
uv run --with pytest pytest tests/test_smoke.py -v -k "sample_and_query_success"
============================= test session starts ==============================
platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/.cache/uv/builds-v0/.tmpjnSgOs/bin/python
cachedir: .pytest_cache
rootdir: /Users/hjerp/Projects/sql-env-F003-dense-reward-system
configfile: pyproject.toml
plugins: anyio-4.13.0
collecting ... collected 25 items / 24 deselected / 1 selected
tests/test_smoke.py::TestEnvironment::test_sample_and_query_success PASSED [100%]
======================= 1 passed, 24 deselected in 3.79s =======================
Notice the targeted flow test passes, showing exploration/query behavior remains valid under dense reward integration.
Verify boundary clamping behavior
This checks upper/lower clamp boundaries for cumulative step rewards.
uv run --with pytest pytest tests/unit/test_reward.py -v -k "compute_reward_clamp_upper or compute_reward_clamp_lower"
============================= test session starts ==============================
platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/.cache/uv/builds-v0/.tmp91LChv/bin/python
cachedir: .pytest_cache
rootdir: /Users/hjerp/Projects/sql-env-F003-dense-reward-system
configfile: pyproject.toml
plugins: anyio-4.13.0
collecting ... collected 66 items / 64 deselected / 2 selected
tests/unit/test_reward.py::TestComputeStepReward::test_compute_reward_clamp_upper PASSED [ 50%]
tests/unit/test_reward.py::TestComputeStepReward::test_compute_reward_clamp_lower PASSED [100%]
======================= 2 passed, 64 deselected in 4.58s =======================
This confirms reward accumulation boundaries are enforced at both extremes.
Existing Evidence
specs/F003-IMPLEMENTATION_SPEC.mdSection 7 includes recorded per-slice evidence for Layer 1, Layer 2, integration wiring, and full-suite verification.specs/FEATURES.jsonincludes approved verification evidence (tests_run: 166,tests_passed: 166).
Manual Verification Checklist
- Start a fresh episode and run one
DESCRIBEaction. - Run at least two distinct
QUERYactions, then repeat one exact query. - Confirm repeat behavior is less rewarding than first-time useful queries.
- Submit an invalid/non-SELECT query and confirm safe penalty behavior.
- End with
ANSWERand verify terminal reward still follows correctness outcome.
Edge Cases Exercised
Invalid non-SELECT query is safely handled
uv run --with pytest pytest tests/test_smoke.py -v -k "query_rejects_non_select"
============================= test session starts ==============================
platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/.cache/uv/builds-v0/.tmpitwmJ8/bin/python
cachedir: .pytest_cache
rootdir: /Users/hjerp/Projects/sql-env-F003-dense-reward-system
configfile: pyproject.toml
plugins: anyio-4.13.0
collecting ... collected 25 items / 24 deselected / 1 selected
tests/test_smoke.py::TestEnvironment::test_query_rejects_non_select PASSED [100%]
======================= 1 passed, 24 deselected in 4.04s =======================
This matters because SQL errors/unsafe query patterns should not break reward flow.
Budget exhaustion keeps terminal reward contract
uv run --with pytest pytest tests/test_smoke.py -v -k "budget_exhaustion_sets_done_and_zero_reward"
============================= test session starts ==============================
platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/.cache/uv/builds-v0/.tmpRB9qch/bin/python
cachedir: .pytest_cache
rootdir: /Users/hjerp/Projects/sql-env-F003-dense-reward-system
configfile: pyproject.toml
plugins: anyio-4.13.0
collecting ... collected 25 items / 24 deselected / 1 selected
tests/test_smoke.py::TestEnvironment::test_budget_exhaustion_sets_done_and_zero_reward PASSED [100%]
======================= 1 passed, 24 deselected in 3.89s =======================
This matters because dense shaping must not corrupt terminal episode semantics.
Test Evidence (Optional)
Supplementary proof that the feature works correctly across broader scenarios.
| Test Suite | Tests | Status |
|---|---|---|
Smoke suite (tests/test_smoke.py) |
25 | All passed |
Representative command:
uv run --with pytest pytest tests/test_smoke.py -v
[... full smoke output ...]
============================== 25 passed in 3.67s ==============================
Feature Links
- Implementation spec:
specs/F003-IMPLEMENTATION_SPEC.md - Verification spec:
specs/F003-VERIFICATION_SPEC.md
Demo generated by feature-demo agent. Re-run with /feature-demo F003 to refresh.