sql_env / specs /F003-DEMO.md
hjerpe's picture
Upload folder using huggingface_hub
5dd1bb4 verified

Feature Demo: F003 — Dense Reward System

Generated: 2026-03-28T06:07:34Z Context source: spec + discovery only (implementation not read) Feature entry: FEATURES.json #F003


What This Feature Does

Before this feature, agents only got a binary reward at the end of an episode, which made exploration hard to learn from. With F003, agents now get small, meaningful reward signals during non-terminal DESCRIBE/SAMPLE/QUERY steps, plus the final terminal correctness reward.

From the user perspective, this means random exploration should produce low cumulative reward, targeted exploration should produce higher reward, and anti-gaming controls should prevent farming rewards via repeated or low-value behavior.


What Is Already Proven

Verified in This Demo Run

  • Happy-path SQL exploration smoke flow passes locally.
  • Non-SELECT query error handling passes locally.
  • Budget-exhaustion terminal reward behavior passes locally.
  • Clamp boundary unit tests for step-reward floor/ceiling pass locally.
  • Full smoke suite passes locally (25/25).

Previously Verified Evidence

  • specs/FEATURES.json records verifier-approved evidence for F003: uv run --with pytest pytest tests/ -v with 166 passed.
  • specs/F003-IMPLEMENTATION_SPEC.md (Section 7, Step 3.2) records final verification evidence and verifier approval.
  • specs/F003-VERIFICATION_SPEC.md defines unit/integration/e2e scenarios and edge-case checklist used for this demo plan.

What Still Needs User Verification

  • Run a real episode manually (resetDESCRIBE/SAMPLE/QUERY/ANSWER) and inspect live observation.reward progression across steps.
  • Confirm training-facing calibration in your own workload (random exploration ~0.1, targeted ~0.3, correct answer total ~1.3) under your runtime conditions.

Quickstart / Verification Steps

Run these commands to see the feature in action:

uv run --with pytest pytest tests/test_smoke.py -v -k "sample_and_query_success"
uv run --with pytest pytest tests/test_smoke.py -v -k "query_rejects_non_select"
uv run --with pytest pytest tests/unit/test_reward.py -v -k "compute_reward_clamp_upper or compute_reward_clamp_lower"

No extra setup was needed in this environment beyond project dependencies.


Live Local Proof

This feature is internal server-side reward logic (no direct end-user CLI command for reward computation itself), so strongest truthful local proof is targeted runtime smoke/unit execution.

Run a happy-path exploration step flow

This validates a representative non-terminal exploration path.

uv run --with pytest pytest tests/test_smoke.py -v -k "sample_and_query_success"
============================= test session starts ==============================
platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/.cache/uv/builds-v0/.tmpjnSgOs/bin/python
cachedir: .pytest_cache
rootdir: /Users/hjerp/Projects/sql-env-F003-dense-reward-system
configfile: pyproject.toml
plugins: anyio-4.13.0
collecting ... collected 25 items / 24 deselected / 1 selected

tests/test_smoke.py::TestEnvironment::test_sample_and_query_success PASSED [100%]

======================= 1 passed, 24 deselected in 3.79s =======================

Notice the targeted flow test passes, showing exploration/query behavior remains valid under dense reward integration.

Verify boundary clamping behavior

This checks upper/lower clamp boundaries for cumulative step rewards.

uv run --with pytest pytest tests/unit/test_reward.py -v -k "compute_reward_clamp_upper or compute_reward_clamp_lower"
============================= test session starts ==============================
platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/.cache/uv/builds-v0/.tmp91LChv/bin/python
cachedir: .pytest_cache
rootdir: /Users/hjerp/Projects/sql-env-F003-dense-reward-system
configfile: pyproject.toml
plugins: anyio-4.13.0
collecting ... collected 66 items / 64 deselected / 2 selected

tests/unit/test_reward.py::TestComputeStepReward::test_compute_reward_clamp_upper PASSED [ 50%]
tests/unit/test_reward.py::TestComputeStepReward::test_compute_reward_clamp_lower PASSED [100%]

======================= 2 passed, 64 deselected in 4.58s =======================

This confirms reward accumulation boundaries are enforced at both extremes.


Existing Evidence

  • specs/F003-IMPLEMENTATION_SPEC.md Section 7 includes recorded per-slice evidence for Layer 1, Layer 2, integration wiring, and full-suite verification.
  • specs/FEATURES.json includes approved verification evidence (tests_run: 166, tests_passed: 166).

Manual Verification Checklist

  1. Start a fresh episode and run one DESCRIBE action.
  2. Run at least two distinct QUERY actions, then repeat one exact query.
  3. Confirm repeat behavior is less rewarding than first-time useful queries.
  4. Submit an invalid/non-SELECT query and confirm safe penalty behavior.
  5. End with ANSWER and verify terminal reward still follows correctness outcome.

Edge Cases Exercised

Invalid non-SELECT query is safely handled

uv run --with pytest pytest tests/test_smoke.py -v -k "query_rejects_non_select"
============================= test session starts ==============================
platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/.cache/uv/builds-v0/.tmpitwmJ8/bin/python
cachedir: .pytest_cache
rootdir: /Users/hjerp/Projects/sql-env-F003-dense-reward-system
configfile: pyproject.toml
plugins: anyio-4.13.0
collecting ... collected 25 items / 24 deselected / 1 selected

tests/test_smoke.py::TestEnvironment::test_query_rejects_non_select PASSED [100%]

======================= 1 passed, 24 deselected in 4.04s =======================

This matters because SQL errors/unsafe query patterns should not break reward flow.

Budget exhaustion keeps terminal reward contract

uv run --with pytest pytest tests/test_smoke.py -v -k "budget_exhaustion_sets_done_and_zero_reward"
============================= test session starts ==============================
platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/.cache/uv/builds-v0/.tmpRB9qch/bin/python
cachedir: .pytest_cache
rootdir: /Users/hjerp/Projects/sql-env-F003-dense-reward-system
configfile: pyproject.toml
plugins: anyio-4.13.0
collecting ... collected 25 items / 24 deselected / 1 selected

tests/test_smoke.py::TestEnvironment::test_budget_exhaustion_sets_done_and_zero_reward PASSED [100%]

======================= 1 passed, 24 deselected in 3.89s =======================

This matters because dense shaping must not corrupt terminal episode semantics.


Test Evidence (Optional)

Supplementary proof that the feature works correctly across broader scenarios.

Test Suite Tests Status
Smoke suite (tests/test_smoke.py) 25 All passed

Representative command:

uv run --with pytest pytest tests/test_smoke.py -v
[... full smoke output ...]
============================== 25 passed in 3.67s ==============================

Feature Links

  • Implementation spec: specs/F003-IMPLEMENTATION_SPEC.md
  • Verification spec: specs/F003-VERIFICATION_SPEC.md

Demo generated by feature-demo agent. Re-run with /feature-demo F003 to refresh.