# Reward System Implementation Guide

This document shows how the reward system is implemented in code and how to use it.

## Module Documentation

The reward system architecture is documented at the module level:

```python
import server.env
print(server.env.__doc__)
```

Output shows all 6 reward components and the final computation formula.

## Reward Constants

All reward constants are defined in `server/env.py` (lines 57-87):

```python
# Component 1: Score improvement reward
PROGRESS_SCALE = 0.25

# Component 2: Syntax/compilation fix reward
SYNTAX_FIX_BONUS = 0.35

# Component 3: Test improvement reward
TEST_PASS_REWARD_SCALE = 0.30

# Component 4: Code quality reward
QUALITY_BONUS_SCALE = 0.15

# Component 5: Stagnation penalty
STAGNATION_PENALTY = 0.10

# Component 6: Regression penalty
REGRESSION_PENALTY_SCALE = 0.20

# One-time completion bonus
COMPLETION_BONUS = 0.50

# Invalid/error penalties
INVALID_ACTION_PENALTY = 0.15
TIMEOUT_PENALTY = 0.15
```

To tune the reward system, edit these constants and re-test.

## RewardDetails Model Documentation

Located in `models.py` (lines 26-80):

```python
from models import RewardDetails
print(RewardDetails.__doc__)
```

Shows all 15 fields and their meanings:
- `value`: Final scalar reward [-1.0, +1.0]
- `progress_delta`: Score improvement component
- `syntax_reward`: Syntax fix bonus
- `test_reward`: Test improvement bonus
- `quality_bonus`: Code quality improvement
- `stagnation_penalty`: Unchanged code penalty
- `regression_penalty`: Score decline penalty
- `reason`: Human-readable explanation
- `prev_score`, `curr_score`: Score before/after
- `code_changed`: Whether code was modified

## Core Computation Method

The main reward computation is in `_compute_reward_components()` (server/env.py, lines 507-703):

```python
def _compute_reward_components(
    self,
    curr_score: float,
    prev_score: float,
    curr_grade: TaskGrade,
    code_changed: bool,
    prev_grade_score: float = 0.0,
) -> dict:
    """Compute all six reward components and return combined result."""
```

### What It Does

1. **Initializes** empty component dict
2. **Computes each component**:
   - Progress: Score improvement scaled by PROGRESS_SCALE
   - Syntax: One-time bonus if first compile
   - Test: Test pass rate improvement scaled by TEST_PASS_REWARD_SCALE
   - Quality: Code quality improvement scaled by QUALITY_BONUS_SCALE
   - Stagnation: Penalty if code unchanged
   - Regression: Penalty if score decreased
3. **Combines**: Sums positives, subtracts negatives
4. **Clamps**: Bounds result to [-1.0, +1.0]

### Key Design Decisions

- **Monotonic tracking**: Best test rate and quality in episode are tracked
- **One-time bonuses**: Syntax reward awarded once per episode
- **Scale capping**: Each component has a maximum (e.g., progress max +0.25)
- **Timeout handling**: Special penalty instead of score-based
- **Clamping**: Final reward bounded for numerical stability

## Debug Logging

When `verbose=True`, the environment prints detailed debug output via `_log_debug_step()`:

```python
env = PythonCodeReviewEnvironment(verbose=True)
obs = env.reset()
obs = env.step(action)
```

Output format:
```
Step  1 | Score: 0.698 | Delta: +0.698 | Reward: +0.4239 | Changed: False
         | Progress=+0.174 | Quality=+0.149 | Stagnation=+0.100
         | Reason: Syntax error detected: '(' was never closed
```

Shows:
- Step number
- Current score and delta from previous
- Final reward value
- Whether code changed
- Non-zero components only
- Human-readable reason

## Example: Full Episode with Rewards

```python
from server.env import PythonCodeReviewEnvironment
from models import PythonCodeReviewAction

env = PythonCodeReviewEnvironment(verbose=True)
obs = env.reset(task_id='syntax-fix-easy')

# Step 1: Analyze (no code change)
action = PythonCodeReviewAction(action_type='analyze_code')
obs = env.step(action)
print(f"Reward 1: {obs.reward_details.value:.4f}")

# Step 2: Edit with fix
code = 'x = 1; y = 2; print(x + y)'
action = PythonCodeReviewAction(action_type='edit_code', code=code)
obs = env.step(action)
print(f"Reward 2: {obs.reward_details.value:.4f}")

# Step 3: Submit
action = PythonCodeReviewAction(action_type='submit_solution')
obs = env.step(action)
print(f"Final Reward: {obs.reward_details.value:.4f}")
```

## Interpreting Rewards

### Positive Rewards (+0 to +1.0)
- **+0.5 - +1.0**: Major progress (syntax fix, many tests passing)
- **+0.2 - +0.5**: Good progress (score improvement, test gains)
- **+0.0 - +0.2**: Small progress (quality improvement, minor gains)

### Negative Rewards (−1.0 to −0)
- **−0.1 - 0**: Stagnation (analyzed without changing code)
- **−0.2 - −0.1**: Slight regression (small score drop)
- **−0.5 - −0.2**: Major regression (significant score drop)
- **−1.0 - −0.5**: Invalid action or timeout

## Tuning the Reward System

### For Faster Early Learning
↑ Increase `SYNTAX_FIX_BONUS` and `COMPLETION_BONUS`

### To Encourage Editing Over Analysis
↑ Increase `STAGNATION_PENALTY`

### To Reward Test Improvements More
↑ Increase `TEST_PASS_REWARD_SCALE`

### To Penalize Mistakes More
↑ Increase `REGRESSION_PENALTY_SCALE`

### To Balance All Components
Adjust the Scale constants (all in range 0.15-0.35 for stability)

## Accessing Documentation Programmatically

```python
from server.env import PythonCodeReviewEnvironment
from models import RewardDetails
import server.env

# Module-level architecture
print(server.env.__doc__)

# RewardDetails fields
print(RewardDetails.__doc__)

# One method
env = PythonCodeReviewEnvironment()
help(env._compute_reward_components)
```

All major functions and classes have comprehensive docstrings.