uvpatel7271 commited on
Commit
9fa2f22
·
verified ·
1 Parent(s): 6266f5f

Upload folder using huggingface_hub

Browse files
Files changed (4) hide show
  1. Project.md +94 -0
  2. REWARD_SYSTEM_GUIDE.md +206 -0
  3. models.py +52 -15
  4. server/env.py +233 -24
Project.md CHANGED
@@ -153,6 +153,100 @@ It uses AST inspection to detect:
153
 
154
  This is not the task grader. It is the direct-review helper.
155
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
156
  ### `server/code_review_environment.py`
157
 
158
  This is the environment core.
 
153
 
154
  This is not the task grader. It is the direct-review helper.
155
 
156
+ ### Reward System
157
+
158
+ The reward system is **dynamic and multi-component**, designed to provide meaningful feedback at every step of the agent's learning process.
159
+
160
+ #### Reward Architecture
161
+
162
+ The system computes rewards using **6 independent components**:
163
+
164
+ 1. **Progress Reward** (max +0.25)
165
+ - Awarded when the agent improves the score from one step to the next
166
+ - Formula: `min(PROGRESS_SCALE * score_delta, 0.25)`
167
+ - Encourages continuous improvement
168
+
169
+ 2. **Syntax Reward** (max +0.35)
170
+ - One-time bonus awarded for fixing syntax errors (first time compiling)
171
+ - Applied once per episode when code transitions from uncompilable to compilable
172
+ - Acknowledges the critical first step of making code valid
173
+
174
+ 3. **Test Reward** (max +0.20)
175
+ - Based on improvement in test pass rate
176
+ - Computed as: `min(TEST_PASS_REWARD_SCALE * test_improvement_fraction, 0.20)`
177
+ - Rewards incremental progress on passing more tests
178
+
179
+ 4. **Quality Reward** (max +0.15)
180
+ - Based on AST-detected code quality metrics
181
+ - Rewards improvements in code structure, readability, and best practices
182
+ - Uses deterministic grader feedback
183
+
184
+ 5. **Stagnation Penalty** (−0.10)
185
+ - Applied when the agent takes action but code doesn't change
186
+ - Encourages the agent to edit the code rather than analyze repeatedly
187
+ - Configurable via `STAGNATION_PENALTY` constant
188
+
189
+ 6. **Regression Penalty** (scale −0.20)
190
+ - Applied when score decreases from previous step
191
+ - Formula: `REGRESSION_PENALTY_SCALE * abs(score_delta)`
192
+ - Discourages actions that make code worse
193
+
194
+ #### Reward Constants
195
+
196
+ Defined at the top of `server/env.py`:
197
+
198
+ ```python
199
+ SYNTAX_FIX_BONUS = 0.35 # One-time syntax reward
200
+ TEST_PASS_REWARD_SCALE = 0.30 # Per test improvement
201
+ QUALITY_BONUS_SCALE = 0.15 # Code quality improvement
202
+ PROGRESS_SCALE = 0.25 # Score improvement
203
+ COMPLETION_BONUS = 0.50 # Full correctness bonus
204
+ INVALID_ACTION_PENALTY = 0.15 # For unsupported actions
205
+ STAGNATION_PENALTY = 0.10 # For unchanged code
206
+ REGRESSION_PENALTY_SCALE = 0.20 # For score decline
207
+ TIMEOUT_PENALTY = 0.15 # For execution timeout
208
+ ```
209
+
210
+ #### Final Reward Computation
211
+
212
+ The final reward is:
213
+
214
+ ```
215
+ total = progress + syntax + test + quality - stagnation - regression
216
+ final_reward = clamp(total, -1.0, +1.0)
217
+ ```
218
+
219
+ The result is always between −1.0 and +1.0, providing bounded, interpretable feedback.
220
+
221
+ #### RewardDetails: Transparent Feedback
222
+
223
+ Every reward is returned as a `RewardDetails` object with these fields:
224
+
225
+ - `value`: The scalar reward for this step
226
+ - `syntax_reward`: Contribution from syntax fixes
227
+ - `test_reward`: Contribution from test improvements
228
+ - `quality_bonus`: Contribution from code quality
229
+ - `progress_delta`: Contribution from score improvement
230
+ - `stagnation_penalty`: Penalty for unchanged code
231
+ - `regression_penalty`: Penalty for score decline
232
+ - `prev_score` / `curr_score`: Score before and after the action
233
+ - `code_changed`: Whether the action modified the code
234
+ - `reason`: Human-readable explanation of the reward
235
+
236
+ This transparency is crucial for:
237
+ - Debugging agent behavior
238
+ - Understanding what drives reward
239
+ - Tuning the constants
240
+ - Training supervised models on reward components
241
+
242
+ #### Why This Design Helps Agents Learn
243
+
244
+ 1. **Non-Constant**: Different actions produce different rewards, enabling meaningful gradient signals
245
+ 2. **Progressive**: Early bonuses (syntax) are high; later improvements are smaller, promoting efficiency
246
+ 3. **Transparent**: Detailed component breakdown helps agents understand what matters
247
+ 4. **Bounded**: Clamping to [−1, 1] prevents reward hacking and explosion
248
+ 5. **Balanced**: Positive and negative signals teach precision and recall together
249
+
250
  ### `server/code_review_environment.py`
251
 
252
  This is the environment core.
REWARD_SYSTEM_GUIDE.md ADDED
@@ -0,0 +1,206 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Reward System Implementation Guide
2
+
3
+ This document shows how the reward system is implemented in code and how to use it.
4
+
5
+ ## Module Documentation
6
+
7
+ The reward system architecture is documented at the module level:
8
+
9
+ ```python
10
+ import server.env
11
+ print(server.env.__doc__)
12
+ ```
13
+
14
+ Output shows all 6 reward components and the final computation formula.
15
+
16
+ ## Reward Constants
17
+
18
+ All reward constants are defined in `server/env.py` (lines 57-87):
19
+
20
+ ```python
21
+ # Component 1: Score improvement reward
22
+ PROGRESS_SCALE = 0.25
23
+
24
+ # Component 2: Syntax/compilation fix reward
25
+ SYNTAX_FIX_BONUS = 0.35
26
+
27
+ # Component 3: Test improvement reward
28
+ TEST_PASS_REWARD_SCALE = 0.30
29
+
30
+ # Component 4: Code quality reward
31
+ QUALITY_BONUS_SCALE = 0.15
32
+
33
+ # Component 5: Stagnation penalty
34
+ STAGNATION_PENALTY = 0.10
35
+
36
+ # Component 6: Regression penalty
37
+ REGRESSION_PENALTY_SCALE = 0.20
38
+
39
+ # One-time completion bonus
40
+ COMPLETION_BONUS = 0.50
41
+
42
+ # Invalid/error penalties
43
+ INVALID_ACTION_PENALTY = 0.15
44
+ TIMEOUT_PENALTY = 0.15
45
+ ```
46
+
47
+ To tune the reward system, edit these constants and re-test.
48
+
49
+ ## RewardDetails Model Documentation
50
+
51
+ Located in `models.py` (lines 26-80):
52
+
53
+ ```python
54
+ from models import RewardDetails
55
+ print(RewardDetails.__doc__)
56
+ ```
57
+
58
+ Shows all 15 fields and their meanings:
59
+ - `value`: Final scalar reward [-1.0, +1.0]
60
+ - `progress_delta`: Score improvement component
61
+ - `syntax_reward`: Syntax fix bonus
62
+ - `test_reward`: Test improvement bonus
63
+ - `quality_bonus`: Code quality improvement
64
+ - `stagnation_penalty`: Unchanged code penalty
65
+ - `regression_penalty`: Score decline penalty
66
+ - `reason`: Human-readable explanation
67
+ - `prev_score`, `curr_score`: Score before/after
68
+ - `code_changed`: Whether code was modified
69
+
70
+ ## Core Computation Method
71
+
72
+ The main reward computation is in `_compute_reward_components()` (server/env.py, lines 507-703):
73
+
74
+ ```python
75
+ def _compute_reward_components(
76
+ self,
77
+ curr_score: float,
78
+ prev_score: float,
79
+ curr_grade: TaskGrade,
80
+ code_changed: bool,
81
+ prev_grade_score: float = 0.0,
82
+ ) -> dict:
83
+ """Compute all six reward components and return combined result."""
84
+ ```
85
+
86
+ ### What It Does
87
+
88
+ 1. **Initializes** empty component dict
89
+ 2. **Computes each component**:
90
+ - Progress: Score improvement scaled by PROGRESS_SCALE
91
+ - Syntax: One-time bonus if first compile
92
+ - Test: Test pass rate improvement scaled by TEST_PASS_REWARD_SCALE
93
+ - Quality: Code quality improvement scaled by QUALITY_BONUS_SCALE
94
+ - Stagnation: Penalty if code unchanged
95
+ - Regression: Penalty if score decreased
96
+ 3. **Combines**: Sums positives, subtracts negatives
97
+ 4. **Clamps**: Bounds result to [-1.0, +1.0]
98
+
99
+ ### Key Design Decisions
100
+
101
+ - **Monotonic tracking**: Best test rate and quality in episode are tracked
102
+ - **One-time bonuses**: Syntax reward awarded once per episode
103
+ - **Scale capping**: Each component has a maximum (e.g., progress max +0.25)
104
+ - **Timeout handling**: Special penalty instead of score-based
105
+ - **Clamping**: Final reward bounded for numerical stability
106
+
107
+ ## Debug Logging
108
+
109
+ When `verbose=True`, the environment prints detailed debug output via `_log_debug_step()`:
110
+
111
+ ```python
112
+ env = PythonCodeReviewEnvironment(verbose=True)
113
+ obs = env.reset()
114
+ obs = env.step(action)
115
+ ```
116
+
117
+ Output format:
118
+ ```
119
+ Step 1 | Score: 0.698 | Delta: +0.698 | Reward: +0.4239 | Changed: False
120
+ | Progress=+0.174 | Quality=+0.149 | Stagnation=+0.100
121
+ | Reason: Syntax error detected: '(' was never closed
122
+ ```
123
+
124
+ Shows:
125
+ - Step number
126
+ - Current score and delta from previous
127
+ - Final reward value
128
+ - Whether code changed
129
+ - Non-zero components only
130
+ - Human-readable reason
131
+
132
+ ## Example: Full Episode with Rewards
133
+
134
+ ```python
135
+ from server.env import PythonCodeReviewEnvironment
136
+ from models import PythonCodeReviewAction
137
+
138
+ env = PythonCodeReviewEnvironment(verbose=True)
139
+ obs = env.reset(task_id='syntax-fix-easy')
140
+
141
+ # Step 1: Analyze (no code change)
142
+ action = PythonCodeReviewAction(action_type='analyze_code')
143
+ obs = env.step(action)
144
+ print(f"Reward 1: {obs.reward_details.value:.4f}")
145
+
146
+ # Step 2: Edit with fix
147
+ code = 'x = 1; y = 2; print(x + y)'
148
+ action = PythonCodeReviewAction(action_type='edit_code', code=code)
149
+ obs = env.step(action)
150
+ print(f"Reward 2: {obs.reward_details.value:.4f}")
151
+
152
+ # Step 3: Submit
153
+ action = PythonCodeReviewAction(action_type='submit_solution')
154
+ obs = env.step(action)
155
+ print(f"Final Reward: {obs.reward_details.value:.4f}")
156
+ ```
157
+
158
+ ## Interpreting Rewards
159
+
160
+ ### Positive Rewards (+0 to +1.0)
161
+ - **+0.5 - +1.0**: Major progress (syntax fix, many tests passing)
162
+ - **+0.2 - +0.5**: Good progress (score improvement, test gains)
163
+ - **+0.0 - +0.2**: Small progress (quality improvement, minor gains)
164
+
165
+ ### Negative Rewards (−1.0 to −0)
166
+ - **−0.1 - 0**: Stagnation (analyzed without changing code)
167
+ - **−0.2 - −0.1**: Slight regression (small score drop)
168
+ - **−0.5 - −0.2**: Major regression (significant score drop)
169
+ - **−1.0 - −0.5**: Invalid action or timeout
170
+
171
+ ## Tuning the Reward System
172
+
173
+ ### For Faster Early Learning
174
+ ↑ Increase `SYNTAX_FIX_BONUS` and `COMPLETION_BONUS`
175
+
176
+ ### To Encourage Editing Over Analysis
177
+ ↑ Increase `STAGNATION_PENALTY`
178
+
179
+ ### To Reward Test Improvements More
180
+ ↑ Increase `TEST_PASS_REWARD_SCALE`
181
+
182
+ ### To Penalize Mistakes More
183
+ ↑ Increase `REGRESSION_PENALTY_SCALE`
184
+
185
+ ### To Balance All Components
186
+ Adjust the Scale constants (all in range 0.15-0.35 for stability)
187
+
188
+ ## Accessing Documentation Programmatically
189
+
190
+ ```python
191
+ from server.env import PythonCodeReviewEnvironment
192
+ from models import RewardDetails
193
+ import server.env
194
+
195
+ # Module-level architecture
196
+ print(server.env.__doc__)
197
+
198
+ # RewardDetails fields
199
+ print(RewardDetails.__doc__)
200
+
201
+ # One method
202
+ env = PythonCodeReviewEnvironment()
203
+ help(env._compute_reward_components)
204
+ ```
205
+
206
+ All major functions and classes have comprehensive docstrings.
models.py CHANGED
@@ -24,24 +24,61 @@ class HistoryEntry(BaseModel):
24
 
25
 
26
  class RewardDetails(BaseModel):
27
- """Detailed reward breakdown for transparency."""
28
-
29
- value: float = Field(..., description="Net scalar reward for this step")
30
- syntax_reward: float = Field(default=0.0, description="Bonus for fixing syntax")
31
- test_reward: float = Field(default=0.0, description="Reward from passing tests")
32
- quality_bonus: float = Field(default=0.0, description="Bonus for code quality improvements")
33
- correctness_bonus: float = Field(default=0.0, description="Bonus for full correctness")
34
- progress_delta: float = Field(default=0.0, description="Reward from score improvement")
35
- stagnation_penalty: float = Field(default=0.0, description="Penalty for code not changing")
36
- regression_penalty: float = Field(default=0.0, description="Penalty for score decline")
37
- invalid_action_penalty: float = Field(default=0.0, description="Penalty for invalid actions")
38
- timeout_penalty: float = Field(default=0.0, description="Penalty for timeouts")
39
- reason: str = Field(..., description="Explanation of reward")
40
 
41
- # Debug info
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
  prev_score: float = Field(default=0.0, description="Score before this step")
43
  curr_score: float = Field(default=0.0, description="Score after this step")
44
- code_changed: bool = Field(default=False, description="Whether code was modified")
45
 
46
 
47
  class PythonCodeReviewAction(Action):
 
24
 
25
 
26
  class RewardDetails(BaseModel):
27
+ """Detailed reward breakdown for transparent agent feedback.
 
 
 
 
 
 
 
 
 
 
 
 
28
 
29
+ The reward system is dynamic and multi-component, with 6 independent sources:
30
+
31
+ 1. Progress Reward (max +0.25)
32
+ - Awarded for score improvement from previous step
33
+ - Formula: min(PROGRESS_SCALE * score_delta, 0.25)
34
+ - Encourages continuous improvement
35
+
36
+ 2. Syntax Reward (max +0.35)
37
+ - One-time bonus for fixing syntax errors (first compile)
38
+ - Applied when code transitions from uncompilable to compilable
39
+ - Acknowledges the critical first step of valid code
40
+
41
+ 3. Test Reward (max +0.20)
42
+ - Based on improvement in test pass rate
43
+ - Formula: min(TEST_PASS_REWARD_SCALE * test_improvement, 0.20)
44
+ - Rewards incremental test progress
45
+
46
+ 4. Quality Reward (max +0.15)
47
+ - Based on AST-detected code quality metrics
48
+ - Rewards improvements in structure, readability, best practices
49
+ - Uses deterministic grader feedback
50
+
51
+ 5. Stagnation Penalty (−0.10)
52
+ - Applied when agent acts but code doesn't change
53
+ - Encourages editing rather than repeated analysis
54
+ - Configurable via STAGNATION_PENALTY constant
55
+
56
+ 6. Regression Penalty (scale −0.20)
57
+ - Applied when score decreases from previous step
58
+ - Formula: REGRESSION_PENALTY_SCALE * abs(score_delta)
59
+ - Discourages actions that make code worse
60
+
61
+ Final Reward: clamp(progress + syntax + test + quality - stagnation - regression, -1.0, +1.0)
62
+
63
+ The result is always bounded in [-1.0, +1.0], providing interpretable feedback for learning.
64
+ """
65
+
66
+ value: float = Field(..., description="Net scalar reward for this step (bounded in [-1.0, +1.0])")
67
+ syntax_reward: float = Field(default=0.0, description="Bonus for fixing syntax errors (max +0.35)")
68
+ test_reward: float = Field(default=0.0, description="Reward from test improvements (max +0.20)")
69
+ quality_bonus: float = Field(default=0.0, description="Bonus for code quality improvements (max +0.15)")
70
+ correctness_bonus: float = Field(default=0.0, description="Bonus for full correctness (max +0.50)")
71
+ progress_delta: float = Field(default=0.0, description="Reward from score improvement (max +0.25)")
72
+ stagnation_penalty: float = Field(default=0.0, description="Penalty for unchanged code (−0.10)")
73
+ regression_penalty: float = Field(default=0.0, description="Penalty for score decline (scale −0.20)")
74
+ invalid_action_penalty: float = Field(default=0.0, description="Penalty for invalid actions (−0.15)")
75
+ timeout_penalty: float = Field(default=0.0, description="Penalty for execution timeout (−0.15)")
76
+ reason: str = Field(..., description="Human-readable explanation of the reward")
77
+
78
+ # Debug information for transparency
79
  prev_score: float = Field(default=0.0, description="Score before this step")
80
  curr_score: float = Field(default=0.0, description="Score after this step")
81
+ code_changed: bool = Field(default=False, description="Whether the action modified the code")
82
 
83
 
84
  class PythonCodeReviewAction(Action):
server/env.py CHANGED
@@ -1,4 +1,44 @@
1
- """Core OpenEnv environment for Python code review and repair tasks."""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
  from __future__ import annotations
4
 
@@ -21,22 +61,55 @@ from models import (
21
  from tasks import TaskSpec, get_task, list_task_descriptors, list_task_summaries, task_ids
22
 
23
 
24
- # Reward shaping constants (balanced for meaningful variation)
25
- SYNTAX_FIX_BONUS = 0.35 # One-time for fixing syntax
26
- TEST_PASS_REWARD_SCALE = 0.30 # Per test improvement
27
- QUALITY_BONUS_SCALE = 0.15 # Code quality improvement
28
- PROGRESS_SCALE = 0.25 # Score improvement
29
- COMPLETION_BONUS = 0.50 # Full correctness (one-time)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
  INVALID_ACTION_PENALTY = 0.15
31
- STAGNATION_PENALTY = 0.10 # If code unchanged but action taken
32
- REGRESSION_PENALTY_SCALE = 0.20 # Per 0.1 score decline
33
  TIMEOUT_PENALTY = 0.15
 
34
 
35
 
36
  class PythonCodeReviewEnvironment(
37
  Environment[PythonCodeReviewAction, PythonCodeReviewObservation, PythonCodeReviewState]
38
  ):
39
- """Production-style environment for reviewing and fixing Python code."""
 
 
 
40
 
41
  SUPPORTS_CONCURRENT_SESSIONS = True
42
 
@@ -433,7 +506,67 @@ class PythonCodeReviewEnvironment(
433
  code_changed: bool,
434
  prev_grade_score: float = 0.0,
435
  ) -> dict:
436
- """Compute all reward components and return as dict."""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
437
  components = {
438
  "progress": 0.0,
439
  "syntax": 0.0,
@@ -444,44 +577,83 @@ class PythonCodeReviewEnvironment(
444
  "total": 0.0,
445
  }
446
 
447
- # 1. Progress reward: score improvement
 
 
 
448
  score_delta = curr_score - prev_score
449
  if score_delta > 0:
 
450
  components["progress"] = min(PROGRESS_SCALE * score_delta, 0.25)
451
 
452
- # 2. Syntax reward: one-time bonus for first time compiling
 
 
 
 
453
  if not self._syntax_reward_awarded and curr_grade.syntax_score >= 0.99:
 
454
  if prev_grade_score < 0.99:
455
  components["syntax"] = SYNTAX_FIX_BONUS
456
  self._syntax_reward_awarded = True
457
 
458
- # 3. Test reward: improvement in test pass rate
 
 
 
459
  if curr_grade.tests_total > 0:
 
460
  curr_test_frac = curr_grade.tests_passed / curr_grade.tests_total
 
461
  test_delta = curr_test_frac - self._best_visible_test_fraction
 
462
  if test_delta > 0:
 
463
  components["test"] = min(TEST_PASS_REWARD_SCALE * test_delta, 0.20)
464
- self._best_visible_test_fraction = max(self._best_visible_test_fraction, curr_test_frac)
465
-
466
- # 4. Quality reward: code quality improvement
 
 
 
 
 
 
 
467
  quality_delta = curr_grade.quality_score - self._best_quality_score
468
  if quality_delta > 0:
 
469
  components["quality"] = min(QUALITY_BONUS_SCALE * quality_delta, 0.15)
470
- self._best_quality_score = max(self._best_quality_score, curr_grade.quality_score)
 
 
 
471
 
472
- # 5. Stagnation penalty: code not changed (except analyze_code)
 
 
 
 
473
  if not code_changed and not (curr_grade.details.get("compile_error") == ""):
474
  components["stagnation"] = -STAGNATION_PENALTY
475
 
476
- # 6. Regression penalty: score decreased
 
 
 
 
477
  if score_delta < 0:
 
478
  components["regression"] = REGRESSION_PENALTY_SCALE * abs(score_delta)
479
 
480
- # 7. Timeout penalty
481
  if curr_grade.timed_out:
482
  components["regression"] = -TIMEOUT_PENALTY
483
 
484
- # Compute total and clamp to [-1.0, 1.0]
 
 
 
485
  total = (
486
  components["progress"]
487
  + components["syntax"]
@@ -490,6 +662,8 @@ class PythonCodeReviewEnvironment(
490
  - components["stagnation"]
491
  - components["regression"]
492
  )
 
 
493
  components["total"] = max(-1.0, min(1.0, round(total, 6)))
494
 
495
  return components
@@ -524,7 +698,39 @@ class PythonCodeReviewEnvironment(
524
  self._state.history.append(entry)
525
 
526
  def _log_debug_step(self, reward: RewardDetails) -> None:
527
- """Log step details for debugging."""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
528
  print(
529
  f"\nStep {self._state.step_count:2d} | "
530
  f"Score: {reward.curr_score:.3f} | "
@@ -533,7 +739,7 @@ class PythonCodeReviewEnvironment(
533
  f"Changed: {reward.code_changed}"
534
  )
535
 
536
- # Log components if non-zero
537
  components = [
538
  ("Progress", reward.progress_delta),
539
  ("Syntax", reward.syntax_reward),
@@ -543,9 +749,12 @@ class PythonCodeReviewEnvironment(
543
  ("Regression", -reward.regression_penalty),
544
  ]
545
 
 
546
  non_zero = [f"{name}={val:+.3f}" for name, val in components if abs(val) > 0.001]
547
  if non_zero:
548
  print(f" | {' | '.join(non_zero)}")
 
 
549
  print(f" | Reason: {reward.reason}")
550
 
551
 
 
1
+ """Core OpenEnv environment for Python code review and repair tasks.
2
+
3
+ REWARD SYSTEM ARCHITECTURE
4
+ ==========================
5
+
6
+ The environment implements a dynamic, multi-component reward system to provide
7
+ meaningful feedback at every step of agent learning.
8
+
9
+ Six independent reward components are computed and combined:
10
+
11
+ 1. PROGRESS REWARD (max +0.25)
12
+ - Awarded for score improvement: min(PROGRESS_SCALE * score_delta, 0.25)
13
+ - Encourages continuous improvement on the task
14
+
15
+ 2. SYNTAX REWARD (max +0.35)
16
+ - One-time bonus when code first becomes compilable
17
+ - Acknowledges the critical step of creating valid code
18
+
19
+ 3. TEST REWARD (max +0.20)
20
+ - Based on test pass rate improvement
21
+ - Formula: min(TEST_PASS_REWARD_SCALE * test_improvement, 0.20)
22
+
23
+ 4. QUALITY REWARD (max +0.15)
24
+ - Based on AST-detected code quality improvements
25
+ - Rewards better structure, readability, best practices
26
+
27
+ 5. STAGNATION PENALTY (−0.10)
28
+ - Applied when agent acts but code doesn't change
29
+ - Encourages editing rather than repeated analysis
30
+
31
+ 6. REGRESSION PENALTY (scale −0.20)
32
+ - Applied when score declines: REGRESSION_PENALTY_SCALE * abs(score_delta)
33
+ - Discourages actions that make code worse
34
+
35
+ FINAL REWARD
36
+ Final reward = clamp(progress + syntax + test + quality - stagnation - regression, -1.0, +1.0)
37
+
38
+ Always bounded in [-1.0, +1.0] for interpretability and learning stability.
39
+
40
+ See RewardDetails in models.py for all fields returned with each reward.
41
+ """
42
 
43
  from __future__ import annotations
44
 
 
61
  from tasks import TaskSpec, get_task, list_task_descriptors, list_task_summaries, task_ids
62
 
63
 
64
+ # ============================================================================
65
+ # REWARD SHAPING CONSTANTS
66
+ # ============================================================================
67
+ # These constants control the reward magnitude for each component.
68
+ # Tuning these values changes agent learning incentives.
69
+
70
+ # Component 1: Score improvement reward
71
+ PROGRESS_SCALE = 0.25
72
+ """Scale for progress rewards. Higher = more reward for score improvement."""
73
+
74
+ # Component 2: Syntax/compilation fix reward
75
+ SYNTAX_FIX_BONUS = 0.35
76
+ """One-time bonus for first time code compiles."""
77
+
78
+ # Component 3: Test improvement reward
79
+ TEST_PASS_REWARD_SCALE = 0.30
80
+ """Scale for test pass rate rewards."""
81
+
82
+ # Component 4: Code quality reward
83
+ QUALITY_BONUS_SCALE = 0.15
84
+ """Scale for code quality improvements (AST-based)."""
85
+
86
+ # Component 5: Stagnation penalty
87
+ STAGNATION_PENALTY = 0.10
88
+ """Penalty when action is taken but code unchanged."""
89
+
90
+ # Component 6: Regression penalty
91
+ REGRESSION_PENALTY_SCALE = 0.20
92
+ """Scale for penalties when score declines."""
93
+
94
+ # One-time completion bonus
95
+ COMPLETION_BONUS = 0.50
96
+ """Bonus for fully correct solution."""
97
+
98
+ # Invalid/error penalties
99
  INVALID_ACTION_PENALTY = 0.15
100
+ """Penalty for unsupported action types."""
101
+
102
  TIMEOUT_PENALTY = 0.15
103
+ """Penalty for execution timeout."""
104
 
105
 
106
  class PythonCodeReviewEnvironment(
107
  Environment[PythonCodeReviewAction, PythonCodeReviewObservation, PythonCodeReviewState]
108
  ):
109
+ """Production-style environment for reviewing and fixing Python code.
110
+
111
+ Implements OpenEnv compatibility and dynamic multi-component reward system.
112
+ """
113
 
114
  SUPPORTS_CONCURRENT_SESSIONS = True
115
 
 
506
  code_changed: bool,
507
  prev_grade_score: float = 0.0,
508
  ) -> dict:
509
+ """Compute all six reward components and return combined result.
510
+
511
+ This method is the core of the reward system. It evaluates agent progress
512
+ across multiple dimensions and provides transparent, component-wise feedback.
513
+
514
+ REWARD COMPONENTS (6 total):
515
+ ============================
516
+
517
+ 1. PROGRESS REWARD (positive, max +0.25)
518
+ - Awarded when score improves from previous step
519
+ - Formula: min(PROGRESS_SCALE * score_delta, 0.25)
520
+ - Why: Encourages monotonic improvement
521
+
522
+ 2. SYNTAX REWARD (positive, max +0.35)
523
+ - One-time bonus when code first compiles
524
+ - Transition: uncompilable → compilable
525
+ - Why: Acknowledges critical first step of valid code
526
+
527
+ 3. TEST REWARD (positive, max +0.20)
528
+ - Based on improvement in test pass rate
529
+ - Formula: min(TEST_PASS_REWARD_SCALE * test_improvement, 0.20)
530
+ - Tracks best test rate seen in episode (monotonic)
531
+ - Why: Rewards incremental progress on passing tests
532
+
533
+ 4. QUALITY REWARD (positive, max +0.15)
534
+ - Based on AST-detected code quality metrics
535
+ - Computed by deterministic grader (syntax_score, quality_score)
536
+ - Tracks best quality seen in episode (monotonic)
537
+ - Why: Teaches code structure and maintainability
538
+
539
+ 5. STAGNATION PENALTY (negative, −0.10)
540
+ - Applied when action is taken but code doesn't change
541
+ - Exception: No penalty if code has compile errors (still debugging)
542
+ - Why: Encourages editing over repeated analysis
543
+
544
+ 6. REGRESSION PENALTY (negative, scale −0.20)
545
+ - Applied when score decreases from previous step
546
+ - Formula: REGRESSION_PENALTY_SCALE * abs(score_delta)
547
+ - Special case: Timeout returns fixed TIMEOUT_PENALTY (−0.15)
548
+ - Why: Discourages actions that make code worse
549
+
550
+ FINAL REWARD:
551
+ =============
552
+ total = progress + syntax + test + quality - stagnation - regression
553
+ final_reward = clamp(total, -1.0, +1.0)
554
+
555
+ The result is always bounded for interpretability and stability.
556
+
557
+ Args:
558
+ curr_score: Current score after action (0.0 to 1.0)
559
+ prev_score: Score from previous step (0.0 to 1.0)
560
+ curr_grade: TaskGrade object with detailed metrics
561
+ code_changed: Boolean, whether the action modified code
562
+ prev_grade_score: Previous syntax_score for detecting first compile
563
+
564
+ Returns:
565
+ dict with keys: "progress", "syntax", "test", "quality",
566
+ "stagnation", "regression", "total"
567
+ All values are floats, with total clamped to [-1.0, +1.0]
568
+ """
569
+ # Initialize all components to zero
570
  components = {
571
  "progress": 0.0,
572
  "syntax": 0.0,
 
577
  "total": 0.0,
578
  }
579
 
580
+ # ====================================================================
581
+ # COMPONENT 1: PROGRESS REWARD
582
+ # ====================================================================
583
+ # Reward score improvement. Encourages continuous progress towards goal.
584
  score_delta = curr_score - prev_score
585
  if score_delta > 0:
586
+ # Scale improvement by constant, cap at 0.25 to prevent dominance
587
  components["progress"] = min(PROGRESS_SCALE * score_delta, 0.25)
588
 
589
+ # ====================================================================
590
+ # COMPONENT 2: SYNTAX REWARD
591
+ # ====================================================================
592
+ # One-time bonus for fixing syntax errors and making code compilable.
593
+ # This is tracked per episode with _syntax_reward_awarded flag.
594
  if not self._syntax_reward_awarded and curr_grade.syntax_score >= 0.99:
595
+ # Only award if transitioning from non-compilable to compilable
596
  if prev_grade_score < 0.99:
597
  components["syntax"] = SYNTAX_FIX_BONUS
598
  self._syntax_reward_awarded = True
599
 
600
+ # ====================================================================
601
+ # COMPONENT 3: TEST REWARD
602
+ # ====================================================================
603
+ # Reward improvement in test pass rate. Track best rate seen this episode.
604
  if curr_grade.tests_total > 0:
605
+ # Fraction of visible tests currently passing
606
  curr_test_frac = curr_grade.tests_passed / curr_grade.tests_total
607
+ # Improvement since best rate seen in episode
608
  test_delta = curr_test_frac - self._best_visible_test_fraction
609
+
610
  if test_delta > 0:
611
+ # Scale improvement, cap at 0.20 to prevent dominance
612
  components["test"] = min(TEST_PASS_REWARD_SCALE * test_delta, 0.20)
613
+ # Update best rate seen in this episode (monotonic)
614
+ self._best_visible_test_fraction = max(
615
+ self._best_visible_test_fraction, curr_test_frac
616
+ )
617
+
618
+ # ====================================================================
619
+ # COMPONENT 4: QUALITY REWARD
620
+ # ====================================================================
621
+ # Reward improvements in code quality (AST-based metrics from grader).
622
+ # Track best quality metric seen in this episode.
623
  quality_delta = curr_grade.quality_score - self._best_quality_score
624
  if quality_delta > 0:
625
+ # Scale improvement, cap at 0.15 to prevent dominance
626
  components["quality"] = min(QUALITY_BONUS_SCALE * quality_delta, 0.15)
627
+ # Update best quality seen in this episode (monotonic)
628
+ self._best_quality_score = max(
629
+ self._best_quality_score, curr_grade.quality_score
630
+ )
631
 
632
+ # ====================================================================
633
+ # COMPONENT 5: STAGNATION PENALTY
634
+ # ====================================================================
635
+ # Penalize when agent acts but doesn't change code (except during debugging).
636
+ # Exception: No penalty if code still has compile errors (debugging mode).
637
  if not code_changed and not (curr_grade.details.get("compile_error") == ""):
638
  components["stagnation"] = -STAGNATION_PENALTY
639
 
640
+ # ====================================================================
641
+ # COMPONENT 6: REGRESSION PENALTY
642
+ # ====================================================================
643
+ # Penalize when score decreases (regression).
644
+ # Special case: Timeout incurs fixed penalty instead of score-based.
645
  if score_delta < 0:
646
+ # Scale penalty by magnitude of regression
647
  components["regression"] = REGRESSION_PENALTY_SCALE * abs(score_delta)
648
 
649
+ # Timeout gets special fixed penalty
650
  if curr_grade.timed_out:
651
  components["regression"] = -TIMEOUT_PENALTY
652
 
653
+ # ====================================================================
654
+ # FINAL REWARD COMPUTATION
655
+ # ====================================================================
656
+ # Combine all components: sum positives, subtract negatives, clamp to [-1, 1]
657
  total = (
658
  components["progress"]
659
  + components["syntax"]
 
662
  - components["stagnation"]
663
  - components["regression"]
664
  )
665
+
666
+ # Clamp to [-1.0, +1.0] for bounded, interpretable rewards
667
  components["total"] = max(-1.0, min(1.0, round(total, 6)))
668
 
669
  return components
 
698
  self._state.history.append(entry)
699
 
700
  def _log_debug_step(self, reward: RewardDetails) -> None:
701
+ """Log step details for debugging and agent understanding.
702
+
703
+ When verbose=True during initialization, this method prints detailed
704
+ information about each step, including:
705
+
706
+ - Step number in episode
707
+ - Score before and after (and delta)
708
+ - Final reward value (bounded in [-1.0, +1.0])
709
+ - Whether code was modified
710
+ - Component breakdown (only non-zero components shown)
711
+ - Human-readable reason/explanation
712
+
713
+ This output is designed to help:
714
+ - Monitor agent learning trajectory
715
+ - Debug why rewards are what they are
716
+ - Verify reward system is functioning correctly
717
+ - Understand what agent actions are incentivized
718
+
719
+ Example output:
720
+ -----
721
+ Step 1 | Score: 0.698 | Delta: +0.698 | Reward: +0.4239 | Changed: False
722
+ | Progress=+0.174 | Quality=+0.149 | Stagnation=+0.100
723
+ | Reason: Syntax error detected: '(' was never closed
724
+
725
+ Step 2 | Score: 1.000 | Delta: +0.302 | Reward: +0.6006 | Changed: True
726
+ | Progress=+0.250 | Syntax=+0.350
727
+ | Reason: Code updated.
728
+ -----
729
+
730
+ Args:
731
+ reward: RewardDetails object containing all reward information
732
+ """
733
+ # Print main step summary line
734
  print(
735
  f"\nStep {self._state.step_count:2d} | "
736
  f"Score: {reward.curr_score:.3f} | "
 
739
  f"Changed: {reward.code_changed}"
740
  )
741
 
742
+ # Build list of all reward components (only show non-zero)
743
  components = [
744
  ("Progress", reward.progress_delta),
745
  ("Syntax", reward.syntax_reward),
 
749
  ("Regression", -reward.regression_penalty),
750
  ]
751
 
752
+ # Filter to only non-zero components for clarity
753
  non_zero = [f"{name}={val:+.3f}" for name, val in components if abs(val) > 0.001]
754
  if non_zero:
755
  print(f" | {' | '.join(non_zero)}")
756
+
757
+ # Print human-readable explanation
758
  print(f" | Reason: {reward.reason}")
759
 
760