title: Reward Shaping Research
description: >-
Theoretical basis for SQLEnv dense reward architecture, comparing
cumulative-cap vs delta-based progress approaches grounded in potential-based
shaping theory
doc_type: explanation
Reward Shaping Research
Last updated: 2026-03-29
Research notes on dense reward design for SQLEnv. Documents the theoretical basis for our reward architecture, the problems with the original cumulative-cap design, and the rationale for switching to per-step clipping with delta-based progress.
Problem Statement
SQLEnv's original reward system used:
- Cumulative tracking with a hard cap at [-0.2, 0.5]
- Improvement-only gating (reward only when
binned_progress > best_progress)
Both violate established RL theory and create practical training problems.
Potential-Based Reward Shaping (Ng et al., 1999)
Paper: Ng, A. Y., Harada, D., & Russell, S. (1999). "Policy invariance under reward transformations: Theory and application to reward shaping." ICML.
Core theorem: Given an MDP M with reward R, define shaped reward R' = R + F. The optimal policy is preserved if and only if F has the form:
F(s, a, s') = γ · Φ(s') − Φ(s)
where Φ: S → ℝ is a potential function and γ is the discount factor.
Why this matters for SQLEnv:
- Cumulative capping is NOT potential-based (the shaping reward depends on trajectory history, not just state transitions)
- Non-potential-based shaping changes the optimal policy in unpredictable ways
- Agents may optimize for shaped reward rather than task completion
Delta-from-previous IS potential-based with γ=1:
F(s, s') = Φ(s') − Φ(s) where Φ(s) = binned_progress(s)
This form provably preserves the optimal policy.
Why Cumulative Caps Are Harmful
The POMDP problem
The cumulative reward counter is part of the environment's hidden state. The agent cannot observe it. This means:
- The same (observation, action) pair can yield different rewards depending on hidden history
- The agent cannot learn when shaping signal will stop
- Credit assignment breaks: "low reward because bad action" vs "low reward because cap hit"
Early saturation
Once cumulative reward hits 0.5, all subsequent steps return zero shaping. For a 15-step episode, if the cap hits at step 5, steps 6-14 have no learning signal. The agent receives no gradient for half the episode.
The cap was redundant
With a 15-step budget and -0.005 step cost:
- Max possible L1 reward per step: +0.025 (exec_ok + new_info - step_cost)
- Max over 15 steps: 0.375
- Realistic total (mixed actions): ~0.15-0.25
- Terminal reward: 1.0
The 4-7x ratio between terminal and exploration makes farming exploration irrational without any cap.
Why Improvement-Only Gating Blocks Learning
No recovery signal
If the agent achieves progress 0.75 on step 3, then regresses to 0.25 on step 4, then recovers to 0.75 on step 5:
- Old design: Steps 4 and 5 both get zero reward (0.25 < 0.75, 0.75 ≤ 0.75)
- Delta design: Step 4 gets -0.075 (regression), step 5 gets +0.075 (recovery)
The delta design gives the agent information about what happened. The old design is silent.
Discourages experimentation
With improvement-only gating, once an agent achieves good progress, any experimental query that might regress temporarily is pure risk (no upside if it doesn't exceed the best). This discourages the kind of exploratory behavior the environment is designed to train.
Current Design: Per-Step Clipping + Delta Progress
Per-step reward structure
L1 Operational (every step):
+0.02 exec_ok (no error)
+0.01 new_info (unique SQL hash)
-0.01 repeat penalty (same SQL)
-0.005 step cost
L2 Progress (QUERY only):
Weighted score: cardinality (25%) + value overlap (50%) + numeric range (25%)
Binned to {0, 0.25, 0.5, 0.75, 1.0}
Delta = binned - previous_progress
Reward = delta * 0.15
L3 Terminal (ANSWER only):
+1.0 correct, 0.0 wrong
Per-step clip: [-0.05, 0.15]
No cumulative tracking.
Anti-farming mechanisms
| Mechanism | What it prevents |
|---|---|
| Hard budget (15 steps) | Infinite exploration |
| Step cost (-0.005) | Idle steps |
| Repeat penalty (-0.01) | Query farming |
| Terminal dominance (1.0 vs ~0.3 max) | Exploration over answering |
| Per-step clip (0.15 max) | Single-step reward spikes |
Comparison of Progress Approaches
| Approach | Recovery signal | Farming risk | Theory |
|---|---|---|---|
| Improvement-only (old) | None | None | No formal guarantee |
| Absolute quality each step | Yes | High (repeat good queries) | None |
| Delta from previous step | Yes | Low | Potential-based (Ng 1999), provably policy-invariant |
GRPO Integration
GRPO was designed for episode-level rewards (DeepSeek-R1). Dense per-step rewards are aggregated to a single episode scalar for GRPO's advantage computation.
"GRPO is Secretly a Process Reward Model" (Sullivan et al., 2025/2026, ICLR 2026) proved that GRPO implicitly performs process-level credit assignment when completions share prefixes. They identified a flaw (non-uniform step distribution) and proposed lambda-GRPO.
For SQLEnv: Dense rewards shape rollout behavior within each episode, but get aggregated to episode-level for GRPO. Weight terminal correctness heavily: ~1.0 correctness + 0.3 progress + 0.1 operational.
Relevant validation:
- TIPS (March 2026): potential-based turn-level shaping for multi-turn LLM agents, 11.8% EM improvement over PPO/GRPO baselines
- ToolRL (2025): finer-grained reward decomposition leads to 17% improvement over base models with GRPO
- StepTool (2024): step-grained reward shaping significantly outperformed outcome-only for tool learning
Future Directions
Diminishing novelty bonuses:
reward = 0.01 / (1 + 0.5 * exploration_count)instead of flat +0.01 per unique query. Classic count-based exploration (Bellemare et al. 2016, Never Give Up) naturally tapers.Curriculum on step budget: Start with generous budget (20 steps) for easy questions, tighten to 10 for hard ones as training progresses.
Per-layer independent clipping: Clip L1 and L2 separately rather than their sum, preventing one layer from consuming the other's budget.
Lambda-GRPO: Apply Sullivan et al.'s fix for non-uniform step distribution to improve credit assignment across steps.
Adaptive Length Penalty (ALP): From "Just Enough Thinking" (2026): per-prompt length penalties based on solve rate. Could adapt step budget per difficulty level.
Why Result-Based, Not SQL-Structure-Based
A natural question: why compare query results to the gold results, rather than comparing the SQL structure to the gold SQL?
The equivalence problem
Multiple SQL queries produce identical results:
SELECT name FROM employees WHERE dept_id IN (SELECT id FROM departments WHERE location = 'NYC')
SELECT e.name FROM employees e JOIN departments d ON e.dept_id = d.id WHERE d.location = 'NYC'
SELECT name FROM employees WHERE EXISTS (SELECT 1 FROM departments WHERE id = employees.dept_id AND location = 'NYC')
Rewarding structural similarity to one gold query penalizes valid alternatives. This creates false negative gradient signal that hurts training.
The field moved away from structural comparison
Spider (Yu et al., 2018) used exact set match (decompose SQL into component sets). BIRD (Li et al., 2023) replaced it with execution accuracy, explicitly arguing that "exact match is too strict and penalizes valid alternative SQL formulations." Every recent system (DAIL-SQL, MAC-SQL, CHESS) evaluates on execution accuracy.
Intermediate queries aren't meant to look like the gold
In our POMDP, the agent runs exploratory queries (SELECT * FROM t LIMIT 5, SELECT COUNT(*)) to gather information. These should look nothing like the gold query. Rewarding structural similarity would push the agent toward exploitation before it has explored enough.
Result comparison is the right signal
| Dimension | Result-based | SQL-structure-based |
|---|---|---|
| Handles SQL equivalence | Yes | No |
| Correlates with true objective | Directly | Indirectly (proxy) |
| Works for exploratory queries | Yes | No (penalizes exploration) |
| Literature support | Strong (BIRD, CodeRL, LEVER) | Declining (Spider exact match being replaced) |
What about SQL validity rewards?
One structural signal IS worth using: penalizing queries that fail to execute (syntax errors, missing tables/columns). This is not SQL similarity — it's SQL validity. We already do this via L1 operational rewards: exec_ok (+0.02) vs error (-0.005 step cost only). This accelerates learning without biasing toward a specific solution path.
References
- Ng, Harada, Russell (1999). Policy invariance under reward transformations. ICML.
- Sullivan et al. (2025/2026). GRPO is Secretly a Process Reward Model. ICLR 2026.
- Shao et al. (2024). DeepSeek-Math: GRPO.
- DeepSeek-AI (2025). DeepSeek-R1.
- TIPS (2026). Turn-Level Information-Potential Reward Shaping.
- ToolRL (2025). Reward is All Tool Learning Needs.
- StepTool (2024). Step-grained RL for Tool Learning.
- RAGEN (2025). Multi-Turn RL for LLM Agents.
- Bellemare et al. (2016). Unifying Count-Based Exploration and Intrinsic Motivation.
- Just Enough Thinking (2026). Adaptive Length Penalties.
- Fireworks AI. Best Practices for Multi-Turn RL.
- Wiewiora, Cottrell, Elkan (2003). Principled Methods for Advising Reinforcement Learning Agents.
- Yu et al. (2018). Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing.
- Li et al. (2023). Can LLM Already Serve as a Database Interface? (BIRD).
- Zhong et al. (2020). Semantic Evaluation for Text-to-SQL with Distilled Test Suites.
- Le et al. (2022). CodeRL: Mastering Code Generation through Pretrained Models and Deep RL.
- Lightman et al. (2023). Let's Verify Step by Step.
- Ni et al. (2023). LEVER: Learning to Verify Language-to-Code Generation with Execution.