Research Summary
Project: SQLEnv
Change: F003 β Dense Reward System (3-layer reward architecture)
Date: 2026-03-27
Status: Draft
1. Change Overview
What We're Changing
Implement the 3-layer reward architecture in server/reward.py:
- Layer 1 (Operational): exec_ok +0.02, new_info +0.01 (capped 0.10), repeat -0.01, step_cost -0.005
- Layer 2 (Progress): Weighted average of cardinality matching + value overlap + numeric range proximity, binned to 5 levels, improvement-only, Γ0.15
- Layer 3 (Terminal): +1.0 correct, 0.0 incorrect/timeout
Wire into step() so non-terminal steps return meaningful reward signals.
Why We're Changing It
Currently all non-terminal steps return reward=None. Agents get no learning signal until ANSWER. Dense rewards make GRPO training converge.
Success Criteria
- Reward varies meaningfully: random exploration ~0.1, targeted queries ~0.3, correct answer ~1.3
- Anti-gaming: can't farm rewards by describing everything or repeating queries
- Progress signal coarsened (5 bins) to prevent reward hill-climbing
- Total step rewards clamped to [-0.2, +0.5]
2. System Context
Current Behavior
server/reward.py is a docstring-only stub β all reward logic needs to be built from scratch
step() returns reward=None for DESCRIBE/SAMPLE/QUERY actions
_handle_answer() returns 1.0 or 0.0 β the only reward signal
EpisodeContext tracks described_tables (set) and action_log (list) but no reward accumulators
Architecture Context
step(action)
βββ DESCRIBE β _handle_describe() β result string
βββ SAMPLE β _handle_sample() β result string
βββ QUERY β _handle_query() β result string
βββ ANSWER β _handle_answer() β (bool, reward)
After action execution (NEW):
reward.compute_step_reward(episode_ctx, action_type, query_rows, error)
βββ Layer 1: operational signals
βββ Layer 2: progress-to-target (QUERY only)
βββ clamp to [-0.2, 0.5] running total
Entry Points
| Entry Point |
Trigger |
Current Flow |
step() |
Every agent action |
Action dispatch β observation (reward=None) |
compute_step_reward() |
To be created β called from step() |
Per-step reward from layers 1+2 |
Data Flow
| Data |
Source |
Shape/Type |
Destination |
| Action type + result |
step() dispatch |
str, list[tuple] |
Layer 1 |
| Query result rows |
_execute_sql() |
list[tuple] |
Layer 2 progress |
| Gold result rows |
_execute_gold_sql() at reset |
list[tuple] |
Layer 2 reference β must store in EpisodeContext |
| Described tables |
EpisodeContext.described_tables |
set[str] |
Layer 1 new_info |
| Query hashes |
Need to add to EpisodeContext |
set[str] |
Layer 1 repeat detection |
| Best progress |
Need to add to EpisodeContext |
float |
Layer 2 improvement tracking |
| Cumulative reward |
Need to add to EpisodeContext |
float |
Clamping |
Critical gap: EpisodeContext stores gold_answer as formatted string only. Layer 2 needs raw list[tuple] gold rows. Must add gold_rows: list[tuple] field and populate at reset().
3. Dependencies
Code We Depend On
| Dependency |
What We Use |
Risk if Changed |
models.py:EpisodeContext |
Episode state β needs new fields |
Must add reward tracking fields |
sql_environment.py:_execute_sql() |
Returns list[tuple] for QUERY |
Need raw rows passed to reward |
sql_environment.py:_execute_gold_sql() |
Returns list[tuple] at reset |
Already returns raw rows β just store them |
| F002 (verifier.py) |
Terminal correctness |
Being built in parallel β Layer 3 can use naive check initially |
Code That Depends On Us
| Dependent |
How They Use Us |
Impact of Our Change |
sql_environment.py:step() |
Calls compute_step_reward() |
Must integrate into step flow |
| F006 (GRPO Training) |
reward_funcs for TRL trainer |
Components exposed as separate functions |
tests/test_smoke.py |
Asserts reward=None for non-ANSWER |
Will break β tests need updating |
4. Risks & Edge Cases
Identified Risks
| Risk |
Likelihood |
Impact |
Mitigation |
| Reward hacking via progress signal |
Medium |
Agent exploits shaping |
Coarsen to 5 bins, cap step rewards, small magnitudes |
| Test breakage |
High |
25 existing tests |
Update test assertions for non-None rewards |
| Gold rows unavailable |
Low |
Layer 2 can't compute |
Fallback: Layer 1 only |
Edge Cases to Handle
| Edge Case |
Current Behavior |
Required Behavior |
| QUERY returns empty result |
reward=None |
Layer 1: exec_ok (+0.02), Layer 2: cardinality=0 |
| QUERY fails with SQL error |
reward=None, error set |
Layer 1: step_cost only (-0.005) |
| DESCRIBE same table twice |
reward=None |
repeat penalty (-0.01), no new_info |
| Gold answer is empty |
reward=None |
Skip Layer 2, Layer 1 only |
| Budget exhausted without ANSWER |
reward=0.0 |
Terminal: 0.0 + clamped step rewards |
Invariants to Preserve
4b. Code Shape & Design Target
Existing Vocabulary
| Concept |
Existing Name |
Location |
| Episode state |
EpisodeContext |
models.py:135 |
| Described tables |
described_tables: set[str] |
models.py:143 |
| Action log |
action_log: list[str] |
models.py:144 |
Target Shape
| Component |
Purpose |
Why This Boundary |
compute_step_reward(ctx, action_type, rows, error) |
Main entry |
Single public entry for step() |
_layer1_operational(ctx, action_type, sql, rows, error) |
Operational signals |
Stateless except episode tracking |
_layer2_progress(ctx, rows) |
Progress-to-target (QUERY only) |
Needs gold_rows comparison |
_cardinality_score(pred_rows, gold_rows) |
Row count comparison |
Tier 1 metric |
_value_overlap_score(pred_rows, gold_rows) |
Jaccard set overlap |
Tier 1 metric |
_numeric_range_score(pred_rows, gold_rows) |
Log-distance for numbers |
Tier 1 metric |
_bin_progress(raw_score) |
Bin to {0, 0.25, 0.5, 0.75, 1.0} |
Anti-gaming |
Abstraction Level
- Current level: Flat β server modules with plain functions
- Recommendation: Match flat style.
server/reward.py with plain functions.
Anti-Patterns to Avoid
- Don't create reward strategy classes
- Don't add row-wise best match initially (add if training shows need)
- Don't import numpy/scipy β pure Python
- Don't re-execute gold SQL per step β cache at reset()
5. Constraints
Technical Constraints
| Constraint |
Requirement |
Notes |
| Performance |
< 5ms per reward computation |
~15 calls per episode |
| No heavy deps |
Pure Python |
No numpy/scipy |
| Deterministic |
Same inputs β same reward |
Required for reproducible training |
Testing Constraints
| Test Suite |
Coverage Area |
Notes |
tests/test_smoke.py |
25 tests, some assert reward=None |
Must update for non-None step rewards |
6. Open Questions
| Question |
Why It Matters |
Who Can Answer |
| Layer 2 combination: weighted average (0.25/0.50/0.25) or adaptive? |
Affects reward quality |
Default: Method 1 per reward_design.md |
Store gold_rows in EpisodeContext or separate cache? |
Design coupling |
Recommend EpisodeContext field |
7. Context Sources
| Source |
Type |
Notes |
server/reward.py |
Code (stub) |
Docstring describes 3-layer architecture |
server/sql_environment.py |
Code |
step() flow, _execute_sql() |
models.py:EpisodeContext |
Code |
Needs new reward-tracking fields |
docs_draft/SQLEnv_Concept_v1.md Section 3 |
Doc |
Complete reward spec |
docs_draft/reward_design.md |
Doc |
Distance metrics, combination methods |
docs_draft/reward-research_gpt-5-2.md |
Doc |
Reward research |