sql_env / specs /F003-RESEARCH_SUMMARY.md
hjerpe's picture
Upload folder using huggingface_hub
5dd1bb4 verified

Research Summary

Project: SQLEnv Change: F003 β€” Dense Reward System (3-layer reward architecture) Date: 2026-03-27 Status: Draft


1. Change Overview

What We're Changing

Implement the 3-layer reward architecture in server/reward.py:

  • Layer 1 (Operational): exec_ok +0.02, new_info +0.01 (capped 0.10), repeat -0.01, step_cost -0.005
  • Layer 2 (Progress): Weighted average of cardinality matching + value overlap + numeric range proximity, binned to 5 levels, improvement-only, Γ—0.15
  • Layer 3 (Terminal): +1.0 correct, 0.0 incorrect/timeout

Wire into step() so non-terminal steps return meaningful reward signals.

Why We're Changing It

Currently all non-terminal steps return reward=None. Agents get no learning signal until ANSWER. Dense rewards make GRPO training converge.

Success Criteria

  • Reward varies meaningfully: random exploration ~0.1, targeted queries ~0.3, correct answer ~1.3
  • Anti-gaming: can't farm rewards by describing everything or repeating queries
  • Progress signal coarsened (5 bins) to prevent reward hill-climbing
  • Total step rewards clamped to [-0.2, +0.5]

2. System Context

Current Behavior

  • server/reward.py is a docstring-only stub β€” all reward logic needs to be built from scratch
  • step() returns reward=None for DESCRIBE/SAMPLE/QUERY actions
  • _handle_answer() returns 1.0 or 0.0 β€” the only reward signal
  • EpisodeContext tracks described_tables (set) and action_log (list) but no reward accumulators

Architecture Context

step(action)
  β”œβ”€β”€ DESCRIBE β†’ _handle_describe() β†’ result string
  β”œβ”€β”€ SAMPLE   β†’ _handle_sample()   β†’ result string
  β”œβ”€β”€ QUERY    β†’ _handle_query()    β†’ result string
  └── ANSWER   β†’ _handle_answer()   β†’ (bool, reward)

  After action execution (NEW):
  reward.compute_step_reward(episode_ctx, action_type, query_rows, error)
    β”œβ”€β”€ Layer 1: operational signals
    β”œβ”€β”€ Layer 2: progress-to-target (QUERY only)
    └── clamp to [-0.2, 0.5] running total

Entry Points

Entry Point Trigger Current Flow
step() Every agent action Action dispatch β†’ observation (reward=None)
compute_step_reward() To be created β€” called from step() Per-step reward from layers 1+2

Data Flow

Data Source Shape/Type Destination
Action type + result step() dispatch str, list[tuple] Layer 1
Query result rows _execute_sql() list[tuple] Layer 2 progress
Gold result rows _execute_gold_sql() at reset list[tuple] Layer 2 reference β€” must store in EpisodeContext
Described tables EpisodeContext.described_tables set[str] Layer 1 new_info
Query hashes Need to add to EpisodeContext set[str] Layer 1 repeat detection
Best progress Need to add to EpisodeContext float Layer 2 improvement tracking
Cumulative reward Need to add to EpisodeContext float Clamping

Critical gap: EpisodeContext stores gold_answer as formatted string only. Layer 2 needs raw list[tuple] gold rows. Must add gold_rows: list[tuple] field and populate at reset().


3. Dependencies

Code We Depend On

Dependency What We Use Risk if Changed
models.py:EpisodeContext Episode state β€” needs new fields Must add reward tracking fields
sql_environment.py:_execute_sql() Returns list[tuple] for QUERY Need raw rows passed to reward
sql_environment.py:_execute_gold_sql() Returns list[tuple] at reset Already returns raw rows β€” just store them
F002 (verifier.py) Terminal correctness Being built in parallel β€” Layer 3 can use naive check initially

Code That Depends On Us

Dependent How They Use Us Impact of Our Change
sql_environment.py:step() Calls compute_step_reward() Must integrate into step flow
F006 (GRPO Training) reward_funcs for TRL trainer Components exposed as separate functions
tests/test_smoke.py Asserts reward=None for non-ANSWER Will break β€” tests need updating

4. Risks & Edge Cases

Identified Risks

Risk Likelihood Impact Mitigation
Reward hacking via progress signal Medium Agent exploits shaping Coarsen to 5 bins, cap step rewards, small magnitudes
Test breakage High 25 existing tests Update test assertions for non-None rewards
Gold rows unavailable Low Layer 2 can't compute Fallback: Layer 1 only

Edge Cases to Handle

Edge Case Current Behavior Required Behavior
QUERY returns empty result reward=None Layer 1: exec_ok (+0.02), Layer 2: cardinality=0
QUERY fails with SQL error reward=None, error set Layer 1: step_cost only (-0.005)
DESCRIBE same table twice reward=None repeat penalty (-0.01), no new_info
Gold answer is empty reward=None Skip Layer 2, Layer 1 only
Budget exhausted without ANSWER reward=0.0 Terminal: 0.0 + clamped step rewards

Invariants to Preserve

  • Terminal correctness always dominates β€” correct answer β‰₯ 1.0
  • Step rewards clamped to [-0.2, +0.5] total
  • Reward is deterministic given same episode state

4b. Code Shape & Design Target

Existing Vocabulary

Concept Existing Name Location
Episode state EpisodeContext models.py:135
Described tables described_tables: set[str] models.py:143
Action log action_log: list[str] models.py:144

Target Shape

Component Purpose Why This Boundary
compute_step_reward(ctx, action_type, rows, error) Main entry Single public entry for step()
_layer1_operational(ctx, action_type, sql, rows, error) Operational signals Stateless except episode tracking
_layer2_progress(ctx, rows) Progress-to-target (QUERY only) Needs gold_rows comparison
_cardinality_score(pred_rows, gold_rows) Row count comparison Tier 1 metric
_value_overlap_score(pred_rows, gold_rows) Jaccard set overlap Tier 1 metric
_numeric_range_score(pred_rows, gold_rows) Log-distance for numbers Tier 1 metric
_bin_progress(raw_score) Bin to {0, 0.25, 0.5, 0.75, 1.0} Anti-gaming

Abstraction Level

  • Current level: Flat β€” server modules with plain functions
  • Recommendation: Match flat style. server/reward.py with plain functions.

Anti-Patterns to Avoid

  • Don't create reward strategy classes
  • Don't add row-wise best match initially (add if training shows need)
  • Don't import numpy/scipy β€” pure Python
  • Don't re-execute gold SQL per step β€” cache at reset()

5. Constraints

Technical Constraints

Constraint Requirement Notes
Performance < 5ms per reward computation ~15 calls per episode
No heavy deps Pure Python No numpy/scipy
Deterministic Same inputs β†’ same reward Required for reproducible training

Testing Constraints

Test Suite Coverage Area Notes
tests/test_smoke.py 25 tests, some assert reward=None Must update for non-None step rewards

6. Open Questions

Question Why It Matters Who Can Answer
Layer 2 combination: weighted average (0.25/0.50/0.25) or adaptive? Affects reward quality Default: Method 1 per reward_design.md
Store gold_rows in EpisodeContext or separate cache? Design coupling Recommend EpisodeContext field

7. Context Sources

Source Type Notes
server/reward.py Code (stub) Docstring describes 3-layer architecture
server/sql_environment.py Code step() flow, _execute_sql()
models.py:EpisodeContext Code Needs new reward-tracking fields
docs_draft/SQLEnv_Concept_v1.md Section 3 Doc Complete reward spec
docs_draft/reward_design.md Doc Distance metrics, combination methods
docs_draft/reward-research_gpt-5-2.md Doc Reward research