# Reward System Implementation Guide This document shows how the reward system is implemented in code and how to use it. ## Module Documentation The reward system architecture is documented at the module level: ```python import server.env print(server.env.__doc__) ``` Output shows all 6 reward components and the final computation formula. ## Reward Constants All reward constants are defined in `server/env.py` (lines 57-87): ```python # Component 1: Score improvement reward PROGRESS_SCALE = 0.25 # Component 2: Syntax/compilation fix reward SYNTAX_FIX_BONUS = 0.35 # Component 3: Test improvement reward TEST_PASS_REWARD_SCALE = 0.30 # Component 4: Code quality reward QUALITY_BONUS_SCALE = 0.15 # Component 5: Stagnation penalty STAGNATION_PENALTY = 0.10 # Component 6: Regression penalty REGRESSION_PENALTY_SCALE = 0.20 # One-time completion bonus COMPLETION_BONUS = 0.50 # Invalid/error penalties INVALID_ACTION_PENALTY = 0.15 TIMEOUT_PENALTY = 0.15 ``` To tune the reward system, edit these constants and re-test. ## RewardDetails Model Documentation Located in `models.py` (lines 26-80): ```python from models import RewardDetails print(RewardDetails.__doc__) ``` Shows all 15 fields and their meanings: - `value`: Final scalar reward [-1.0, +1.0] - `progress_delta`: Score improvement component - `syntax_reward`: Syntax fix bonus - `test_reward`: Test improvement bonus - `quality_bonus`: Code quality improvement - `stagnation_penalty`: Unchanged code penalty - `regression_penalty`: Score decline penalty - `reason`: Human-readable explanation - `prev_score`, `curr_score`: Score before/after - `code_changed`: Whether code was modified ## Core Computation Method The main reward computation is in `_compute_reward_components()` (server/env.py, lines 507-703): ```python def _compute_reward_components( self, curr_score: float, prev_score: float, curr_grade: TaskGrade, code_changed: bool, prev_grade_score: float = 0.0, ) -> dict: """Compute all six reward components and return combined result.""" ``` ### What It Does 1. **Initializes** empty component dict 2. **Computes each component**: - Progress: Score improvement scaled by PROGRESS_SCALE - Syntax: One-time bonus if first compile - Test: Test pass rate improvement scaled by TEST_PASS_REWARD_SCALE - Quality: Code quality improvement scaled by QUALITY_BONUS_SCALE - Stagnation: Penalty if code unchanged - Regression: Penalty if score decreased 3. **Combines**: Sums positives, subtracts negatives 4. **Clamps**: Bounds result to [-1.0, +1.0] ### Key Design Decisions - **Monotonic tracking**: Best test rate and quality in episode are tracked - **One-time bonuses**: Syntax reward awarded once per episode - **Scale capping**: Each component has a maximum (e.g., progress max +0.25) - **Timeout handling**: Special penalty instead of score-based - **Clamping**: Final reward bounded for numerical stability ## Debug Logging When `verbose=True`, the environment prints detailed debug output via `_log_debug_step()`: ```python env = PythonCodeReviewEnvironment(verbose=True) obs = env.reset() obs = env.step(action) ``` Output format: ``` Step 1 | Score: 0.698 | Delta: +0.698 | Reward: +0.4239 | Changed: False | Progress=+0.174 | Quality=+0.149 | Stagnation=+0.100 | Reason: Syntax error detected: '(' was never closed ``` Shows: - Step number - Current score and delta from previous - Final reward value - Whether code changed - Non-zero components only - Human-readable reason ## Example: Full Episode with Rewards ```python from server.env import PythonCodeReviewEnvironment from models import PythonCodeReviewAction env = PythonCodeReviewEnvironment(verbose=True) obs = env.reset(task_id='syntax-fix-easy') # Step 1: Analyze (no code change) action = PythonCodeReviewAction(action_type='analyze_code') obs = env.step(action) print(f"Reward 1: {obs.reward_details.value:.4f}") # Step 2: Edit with fix code = 'x = 1; y = 2; print(x + y)' action = PythonCodeReviewAction(action_type='edit_code', code=code) obs = env.step(action) print(f"Reward 2: {obs.reward_details.value:.4f}") # Step 3: Submit action = PythonCodeReviewAction(action_type='submit_solution') obs = env.step(action) print(f"Final Reward: {obs.reward_details.value:.4f}") ``` ## Interpreting Rewards ### Positive Rewards (+0 to +1.0) - **+0.5 - +1.0**: Major progress (syntax fix, many tests passing) - **+0.2 - +0.5**: Good progress (score improvement, test gains) - **+0.0 - +0.2**: Small progress (quality improvement, minor gains) ### Negative Rewards (−1.0 to −0) - **−0.1 - 0**: Stagnation (analyzed without changing code) - **−0.2 - −0.1**: Slight regression (small score drop) - **−0.5 - −0.2**: Major regression (significant score drop) - **−1.0 - −0.5**: Invalid action or timeout ## Tuning the Reward System ### For Faster Early Learning ↑ Increase `SYNTAX_FIX_BONUS` and `COMPLETION_BONUS` ### To Encourage Editing Over Analysis ↑ Increase `STAGNATION_PENALTY` ### To Reward Test Improvements More ↑ Increase `TEST_PASS_REWARD_SCALE` ### To Penalize Mistakes More ↑ Increase `REGRESSION_PENALTY_SCALE` ### To Balance All Components Adjust the Scale constants (all in range 0.15-0.35 for stability) ## Accessing Documentation Programmatically ```python from server.env import PythonCodeReviewEnvironment from models import RewardDetails import server.env # Module-level architecture print(server.env.__doc__) # RewardDetails fields print(RewardDetails.__doc__) # One method env = PythonCodeReviewEnvironment() help(env._compute_reward_components) ``` All major functions and classes have comprehensive docstrings.