Spaces:

uvpatel7271
/

python_env

Build error

App Files Files Community

python_env / REWARD_SYSTEM_GUIDE.md

uvpatel7271

Upload folder using huggingface_hub

c8e832f verified 10 days ago

preview code

raw

history blame contribute delete

5.92 kB

Reward System Implementation Guide

This document shows how the reward system is implemented in code and how to use it.

Module Documentation

The reward system architecture is documented at the module level:

import server.env
print(server.env.__doc__)

Output shows all 6 reward components and the final computation formula.

Reward Constants

All reward constants are defined in server/env.py (lines 57-87):

# Component 1: Score improvement reward
PROGRESS_SCALE = 0.25

# Component 2: Syntax/compilation fix reward
SYNTAX_FIX_BONUS = 0.35

# Component 3: Test improvement reward
TEST_PASS_REWARD_SCALE = 0.30

# Component 4: Code quality reward
QUALITY_BONUS_SCALE = 0.15

# Component 5: Stagnation penalty
STAGNATION_PENALTY = 0.10

# Component 6: Regression penalty
REGRESSION_PENALTY_SCALE = 0.20

# One-time completion bonus
COMPLETION_BONUS = 0.50

# Invalid/error penalties
INVALID_ACTION_PENALTY = 0.15
TIMEOUT_PENALTY = 0.15

To tune the reward system, edit these constants and re-test.

RewardDetails Model Documentation

Located in models.py (lines 26-80):

from models import RewardDetails
print(RewardDetails.__doc__)

Shows all 15 fields and their meanings:

value: Final scalar reward [-1.0, +1.0]
progress_delta: Score improvement component
syntax_reward: Syntax fix bonus
test_reward: Test improvement bonus
quality_bonus: Code quality improvement
stagnation_penalty: Unchanged code penalty
regression_penalty: Score decline penalty
reason: Human-readable explanation
prev_score, curr_score: Score before/after
code_changed: Whether code was modified

Core Computation Method

The main reward computation is in _compute_reward_components() (server/env.py, lines 507-703):

def _compute_reward_components(
    self,
    curr_score: float,
    prev_score: float,
    curr_grade: TaskGrade,
    code_changed: bool,
    prev_grade_score: float = 0.0,
) -> dict:
    """Compute all six reward components and return combined result."""

What It Does

Initializes empty component dict
Computes each component:
- Progress: Score improvement scaled by PROGRESS_SCALE
- Syntax: One-time bonus if first compile
- Test: Test pass rate improvement scaled by TEST_PASS_REWARD_SCALE
- Quality: Code quality improvement scaled by QUALITY_BONUS_SCALE
- Stagnation: Penalty if code unchanged
- Regression: Penalty if score decreased
Combines: Sums positives, subtracts negatives
Clamps: Bounds result to [-1.0, +1.0]

Key Design Decisions

Monotonic tracking: Best test rate and quality in episode are tracked
One-time bonuses: Syntax reward awarded once per episode
Scale capping: Each component has a maximum (e.g., progress max +0.25)
Timeout handling: Special penalty instead of score-based
Clamping: Final reward bounded for numerical stability

Debug Logging

When verbose=True, the environment prints detailed debug output via _log_debug_step():

env = PythonCodeReviewEnvironment(verbose=True)
obs = env.reset()
obs = env.step(action)

Output format:

Step  1 | Score: 0.698 | Delta: +0.698 | Reward: +0.4239 | Changed: False
         | Progress=+0.174 | Quality=+0.149 | Stagnation=+0.100
         | Reason: Syntax error detected: '(' was never closed

Shows:

Step number
Current score and delta from previous
Final reward value
Whether code changed
Non-zero components only
Human-readable reason

Example: Full Episode with Rewards

from server.env import PythonCodeReviewEnvironment
from models import PythonCodeReviewAction

env = PythonCodeReviewEnvironment(verbose=True)
obs = env.reset(task_id='syntax-fix-easy')

# Step 1: Analyze (no code change)
action = PythonCodeReviewAction(action_type='analyze_code')
obs = env.step(action)
print(f"Reward 1: {obs.reward_details.value:.4f}")

# Step 2: Edit with fix
code = 'x = 1; y = 2; print(x + y)'
action = PythonCodeReviewAction(action_type='edit_code', code=code)
obs = env.step(action)
print(f"Reward 2: {obs.reward_details.value:.4f}")

# Step 3: Submit
action = PythonCodeReviewAction(action_type='submit_solution')
obs = env.step(action)
print(f"Final Reward: {obs.reward_details.value:.4f}")

Interpreting Rewards

Positive Rewards (+0 to +1.0)

+0.5 - +1.0: Major progress (syntax fix, many tests passing)
+0.2 - +0.5: Good progress (score improvement, test gains)
+0.0 - +0.2: Small progress (quality improvement, minor gains)

Negative Rewards (−1.0 to −0)

−0.1 - 0: Stagnation (analyzed without changing code)
−0.2 - −0.1: Slight regression (small score drop)
−0.5 - −0.2: Major regression (significant score drop)
−1.0 - −0.5: Invalid action or timeout

Tuning the Reward System

For Faster Early Learning

↑ Increase SYNTAX_FIX_BONUS and COMPLETION_BONUS

To Encourage Editing Over Analysis

↑ Increase STAGNATION_PENALTY

To Reward Test Improvements More

↑ Increase TEST_PASS_REWARD_SCALE

To Penalize Mistakes More

↑ Increase REGRESSION_PENALTY_SCALE

To Balance All Components

Adjust the Scale constants (all in range 0.15-0.35 for stability)

Accessing Documentation Programmatically

from server.env import PythonCodeReviewEnvironment
from models import RewardDetails
import server.env

# Module-level architecture
print(server.env.__doc__)

# RewardDetails fields
print(RewardDetails.__doc__)

# One method
env = PythonCodeReviewEnvironment()
help(env._compute_reward_components)

All major functions and classes have comprehensive docstrings.