Spaces:

hjerpe
/

sql_env

Running

App Files Files Community

sql_env / docs /design-docs /reward-shaping-research.md

hjerpe

Upload folder using huggingface_hub

9e64e71 verified 8 days ago

preview code

raw

history blame contribute delete

10.9 kB

metadata

title: Reward Shaping Research
description: >-
  Theoretical basis for SQLEnv dense reward architecture, comparing
  cumulative-cap vs delta-based progress approaches grounded in potential-based
  shaping theory
doc_type: explanation

Reward Shaping Research

Last updated: 2026-03-29

Research notes on dense reward design for SQLEnv. Documents the theoretical basis for our reward architecture, the problems with the original cumulative-cap design, and the rationale for switching to per-step clipping with delta-based progress.

Problem Statement

SQLEnv's original reward system used:

Cumulative tracking with a hard cap at [-0.2, 0.5]
Improvement-only gating (reward only when binned_progress > best_progress)

Both violate established RL theory and create practical training problems.

Potential-Based Reward Shaping (Ng et al., 1999)

Paper: Ng, A. Y., Harada, D., & Russell, S. (1999). "Policy invariance under reward transformations: Theory and application to reward shaping." ICML.

Core theorem: Given an MDP M with reward R, define shaped reward R' = R + F. The optimal policy is preserved if and only if F has the form:

F(s, a, s') = γ · Φ(s') − Φ(s)

where Φ: S → ℝ is a potential function and γ is the discount factor.

Why this matters for SQLEnv:

Cumulative capping is NOT potential-based (the shaping reward depends on trajectory history, not just state transitions)
Non-potential-based shaping changes the optimal policy in unpredictable ways
Agents may optimize for shaped reward rather than task completion

Delta-from-previous IS potential-based with γ=1:

F(s, s') = Φ(s') − Φ(s)  where Φ(s) = binned_progress(s)

This form provably preserves the optimal policy.

Why Cumulative Caps Are Harmful

The POMDP problem

The cumulative reward counter is part of the environment's hidden state. The agent cannot observe it. This means:

The same (observation, action) pair can yield different rewards depending on hidden history
The agent cannot learn when shaping signal will stop
Credit assignment breaks: "low reward because bad action" vs "low reward because cap hit"

Early saturation

Once cumulative reward hits 0.5, all subsequent steps return zero shaping. For a 15-step episode, if the cap hits at step 5, steps 6-14 have no learning signal. The agent receives no gradient for half the episode.

The cap was redundant

With a 15-step budget and -0.005 step cost:

Max possible L1 reward per step: +0.025 (exec_ok + new_info - step_cost)
Max over 15 steps: 0.375
Realistic total (mixed actions): ~0.15-0.25
Terminal reward: 1.0

The 4-7x ratio between terminal and exploration makes farming exploration irrational without any cap.

Why Improvement-Only Gating Blocks Learning

No recovery signal

If the agent achieves progress 0.75 on step 3, then regresses to 0.25 on step 4, then recovers to 0.75 on step 5:

Old design: Steps 4 and 5 both get zero reward (0.25 < 0.75, 0.75 ≤ 0.75)
Delta design: Step 4 gets -0.075 (regression), step 5 gets +0.075 (recovery)

The delta design gives the agent information about what happened. The old design is silent.

Discourages experimentation

With improvement-only gating, once an agent achieves good progress, any experimental query that might regress temporarily is pure risk (no upside if it doesn't exceed the best). This discourages the kind of exploratory behavior the environment is designed to train.

Current Design: Per-Step Clipping + Delta Progress

Per-step reward structure

L1 Operational (every step):
  +0.02  exec_ok (no error)
  +0.01  new_info (unique SQL hash)
  -0.01  repeat penalty (same SQL)
  -0.005 step cost

L2 Progress (QUERY only):
  Weighted score: cardinality (25%) + value overlap (50%) + numeric range (25%)
  Binned to {0, 0.25, 0.5, 0.75, 1.0}
  Delta = binned - previous_progress
  Reward = delta * 0.15

L3 Terminal (ANSWER only):
  +1.0 correct, 0.0 wrong

Per-step clip: [-0.05, 0.15]
No cumulative tracking.

Anti-farming mechanisms

Mechanism	What it prevents
Hard budget (15 steps)	Infinite exploration
Step cost (-0.005)	Idle steps
Repeat penalty (-0.01)	Query farming
Terminal dominance (1.0 vs ~0.3 max)	Exploration over answering
Per-step clip (0.15 max)	Single-step reward spikes

Comparison of Progress Approaches

Approach	Recovery signal	Farming risk	Theory
Improvement-only (old)	None	None	No formal guarantee
Absolute quality each step	Yes	High (repeat good queries)	None
Delta from previous step	Yes	Low	Potential-based (Ng 1999), provably policy-invariant

GRPO Integration

GRPO was designed for episode-level rewards (DeepSeek-R1). Dense per-step rewards are aggregated to a single episode scalar for GRPO's advantage computation.

"GRPO is Secretly a Process Reward Model" (Sullivan et al., 2025/2026, ICLR 2026) proved that GRPO implicitly performs process-level credit assignment when completions share prefixes. They identified a flaw (non-uniform step distribution) and proposed lambda-GRPO.

For SQLEnv: Dense rewards shape rollout behavior within each episode, but get aggregated to episode-level for GRPO. Weight terminal correctness heavily: ~1.0 correctness + 0.3 progress + 0.1 operational.

Relevant validation:

TIPS (March 2026): potential-based turn-level shaping for multi-turn LLM agents, 11.8% EM improvement over PPO/GRPO baselines
ToolRL (2025): finer-grained reward decomposition leads to 17% improvement over base models with GRPO
StepTool (2024): step-grained reward shaping significantly outperformed outcome-only for tool learning

Future Directions

Diminishing novelty bonuses: reward = 0.01 / (1 + 0.5 * exploration_count) instead of flat +0.01 per unique query. Classic count-based exploration (Bellemare et al. 2016, Never Give Up) naturally tapers.
Curriculum on step budget: Start with generous budget (20 steps) for easy questions, tighten to 10 for hard ones as training progresses.
Per-layer independent clipping: Clip L1 and L2 separately rather than their sum, preventing one layer from consuming the other's budget.
Lambda-GRPO: Apply Sullivan et al.'s fix for non-uniform step distribution to improve credit assignment across steps.
Adaptive Length Penalty (ALP): From "Just Enough Thinking" (2026): per-prompt length penalties based on solve rate. Could adapt step budget per difficulty level.

Why Result-Based, Not SQL-Structure-Based

A natural question: why compare query results to the gold results, rather than comparing the SQL structure to the gold SQL?

The equivalence problem

Multiple SQL queries produce identical results:

SELECT name FROM employees WHERE dept_id IN (SELECT id FROM departments WHERE location = 'NYC')
SELECT e.name FROM employees e JOIN departments d ON e.dept_id = d.id WHERE d.location = 'NYC'
SELECT name FROM employees WHERE EXISTS (SELECT 1 FROM departments WHERE id = employees.dept_id AND location = 'NYC')

Rewarding structural similarity to one gold query penalizes valid alternatives. This creates false negative gradient signal that hurts training.

The field moved away from structural comparison

Spider (Yu et al., 2018) used exact set match (decompose SQL into component sets). BIRD (Li et al., 2023) replaced it with execution accuracy, explicitly arguing that "exact match is too strict and penalizes valid alternative SQL formulations." Every recent system (DAIL-SQL, MAC-SQL, CHESS) evaluates on execution accuracy.

Intermediate queries aren't meant to look like the gold

In our POMDP, the agent runs exploratory queries (SELECT * FROM t LIMIT 5, SELECT COUNT(*)) to gather information. These should look nothing like the gold query. Rewarding structural similarity would push the agent toward exploitation before it has explored enough.

Result comparison is the right signal

Dimension	Result-based	SQL-structure-based
Handles SQL equivalence	Yes	No
Correlates with true objective	Directly	Indirectly (proxy)
Works for exploratory queries	Yes	No (penalizes exploration)
Literature support	Strong (BIRD, CodeRL, LEVER)	Declining (Spider exact match being replaced)

What about SQL validity rewards?

One structural signal IS worth using: penalizing queries that fail to execute (syntax errors, missing tables/columns). This is not SQL similarity — it's SQL validity. We already do this via L1 operational rewards: exec_ok (+0.02) vs error (-0.005 step cost only). This accelerates learning without biasing toward a specific solution path.

References

Ng, Harada, Russell (1999). Policy invariance under reward transformations. ICML.
Sullivan et al. (2025/2026). GRPO is Secretly a Process Reward Model. ICLR 2026.
Shao et al. (2024). DeepSeek-Math: GRPO.
DeepSeek-AI (2025). DeepSeek-R1.
TIPS (2026). Turn-Level Information-Potential Reward Shaping.
ToolRL (2025). Reward is All Tool Learning Needs.
StepTool (2024). Step-grained RL for Tool Learning.
RAGEN (2025). Multi-Turn RL for LLM Agents.
Bellemare et al. (2016). Unifying Count-Based Exploration and Intrinsic Motivation.
Just Enough Thinking (2026). Adaptive Length Penalties.
Fireworks AI. Best Practices for Multi-Turn RL.
Wiewiora, Cottrell, Elkan (2003). Principled Methods for Advising Reinforcement Learning Agents.
Yu et al. (2018). Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing.
Li et al. (2023). Can LLM Already Serve as a Database Interface? (BIRD).
Zhong et al. (2020). Semantic Evaluation for Text-to-SQL with Distilled Test Suites.
Le et al. (2022). CodeRL: Mastering Code Generation through Pretrained Models and Deep RL.
Lightman et al. (2023). Let's Verify Step by Step.
Ni et al. (2023). LEVER: Learning to Verify Language-to-Code Generation with Execution.