Spaces:
Build error
Reward System Implementation Guide
This document shows how the reward system is implemented in code and how to use it.
Module Documentation
The reward system architecture is documented at the module level:
import server.env
print(server.env.__doc__)
Output shows all 6 reward components and the final computation formula.
Reward Constants
All reward constants are defined in server/env.py (lines 57-87):
# Component 1: Score improvement reward
PROGRESS_SCALE = 0.25
# Component 2: Syntax/compilation fix reward
SYNTAX_FIX_BONUS = 0.35
# Component 3: Test improvement reward
TEST_PASS_REWARD_SCALE = 0.30
# Component 4: Code quality reward
QUALITY_BONUS_SCALE = 0.15
# Component 5: Stagnation penalty
STAGNATION_PENALTY = 0.10
# Component 6: Regression penalty
REGRESSION_PENALTY_SCALE = 0.20
# One-time completion bonus
COMPLETION_BONUS = 0.50
# Invalid/error penalties
INVALID_ACTION_PENALTY = 0.15
TIMEOUT_PENALTY = 0.15
To tune the reward system, edit these constants and re-test.
RewardDetails Model Documentation
Located in models.py (lines 26-80):
from models import RewardDetails
print(RewardDetails.__doc__)
Shows all 15 fields and their meanings:
value: Final scalar reward [-1.0, +1.0]progress_delta: Score improvement componentsyntax_reward: Syntax fix bonustest_reward: Test improvement bonusquality_bonus: Code quality improvementstagnation_penalty: Unchanged code penaltyregression_penalty: Score decline penaltyreason: Human-readable explanationprev_score,curr_score: Score before/aftercode_changed: Whether code was modified
Core Computation Method
The main reward computation is in _compute_reward_components() (server/env.py, lines 507-703):
def _compute_reward_components(
self,
curr_score: float,
prev_score: float,
curr_grade: TaskGrade,
code_changed: bool,
prev_grade_score: float = 0.0,
) -> dict:
"""Compute all six reward components and return combined result."""
What It Does
- Initializes empty component dict
- Computes each component:
- Progress: Score improvement scaled by PROGRESS_SCALE
- Syntax: One-time bonus if first compile
- Test: Test pass rate improvement scaled by TEST_PASS_REWARD_SCALE
- Quality: Code quality improvement scaled by QUALITY_BONUS_SCALE
- Stagnation: Penalty if code unchanged
- Regression: Penalty if score decreased
- Combines: Sums positives, subtracts negatives
- Clamps: Bounds result to [-1.0, +1.0]
Key Design Decisions
- Monotonic tracking: Best test rate and quality in episode are tracked
- One-time bonuses: Syntax reward awarded once per episode
- Scale capping: Each component has a maximum (e.g., progress max +0.25)
- Timeout handling: Special penalty instead of score-based
- Clamping: Final reward bounded for numerical stability
Debug Logging
When verbose=True, the environment prints detailed debug output via _log_debug_step():
env = PythonCodeReviewEnvironment(verbose=True)
obs = env.reset()
obs = env.step(action)
Output format:
Step 1 | Score: 0.698 | Delta: +0.698 | Reward: +0.4239 | Changed: False
| Progress=+0.174 | Quality=+0.149 | Stagnation=+0.100
| Reason: Syntax error detected: '(' was never closed
Shows:
- Step number
- Current score and delta from previous
- Final reward value
- Whether code changed
- Non-zero components only
- Human-readable reason
Example: Full Episode with Rewards
from server.env import PythonCodeReviewEnvironment
from models import PythonCodeReviewAction
env = PythonCodeReviewEnvironment(verbose=True)
obs = env.reset(task_id='syntax-fix-easy')
# Step 1: Analyze (no code change)
action = PythonCodeReviewAction(action_type='analyze_code')
obs = env.step(action)
print(f"Reward 1: {obs.reward_details.value:.4f}")
# Step 2: Edit with fix
code = 'x = 1; y = 2; print(x + y)'
action = PythonCodeReviewAction(action_type='edit_code', code=code)
obs = env.step(action)
print(f"Reward 2: {obs.reward_details.value:.4f}")
# Step 3: Submit
action = PythonCodeReviewAction(action_type='submit_solution')
obs = env.step(action)
print(f"Final Reward: {obs.reward_details.value:.4f}")
Interpreting Rewards
Positive Rewards (+0 to +1.0)
- +0.5 - +1.0: Major progress (syntax fix, many tests passing)
- +0.2 - +0.5: Good progress (score improvement, test gains)
- +0.0 - +0.2: Small progress (quality improvement, minor gains)
Negative Rewards (β1.0 to β0)
- β0.1 - 0: Stagnation (analyzed without changing code)
- β0.2 - β0.1: Slight regression (small score drop)
- β0.5 - β0.2: Major regression (significant score drop)
- β1.0 - β0.5: Invalid action or timeout
Tuning the Reward System
For Faster Early Learning
β Increase SYNTAX_FIX_BONUS and COMPLETION_BONUS
To Encourage Editing Over Analysis
β Increase STAGNATION_PENALTY
To Reward Test Improvements More
β Increase TEST_PASS_REWARD_SCALE
To Penalize Mistakes More
β Increase REGRESSION_PENALTY_SCALE
To Balance All Components
Adjust the Scale constants (all in range 0.15-0.35 for stability)
Accessing Documentation Programmatically
from server.env import PythonCodeReviewEnvironment
from models import RewardDetails
import server.env
# Module-level architecture
print(server.env.__doc__)
# RewardDetails fields
print(RewardDetails.__doc__)
# One method
env = PythonCodeReviewEnvironment()
help(env._compute_reward_components)
All major functions and classes have comprehensive docstrings.