Mihir1107 commited on
Commit
2a3e99f
·
1 Parent(s): c90be96

Fix README: update reward formula, tasks table, remove duplicate line

Browse files
Files changed (1) hide show
  1. README.md +11 -11
README.md CHANGED
@@ -49,28 +49,29 @@ Strategy weights are normalised internally and do not need to sum to 1.
49
 
50
  | Name | flip_y | Budget | max_steps | Success criteria | Expected random score |
51
  |---|---|---|---|---|---|
52
- | `easy` | 0.05 | 300 | 15 | performance > 0.55 | ~0.60 |
53
- | `medium` | 0.25 | 150 | 12 | performance > 0.52 AND avg noise ratio < 0.30 | ~0.40 |
54
- | `hard` | 0.30 | 100 | 8 | performance > 0.53 (+ budget efficiency) | ~0.30 |
55
 
56
  ---
57
 
58
  ## Reward Function
59
 
60
  ```
61
- gain = (new_performance - old_performance) * 5.0
62
- + 0.2 * std(selected_batch) # diversity bonus
63
- + 0.2 * (new_performance - old_performance) # alignment bonus
64
 
65
- if redundancy > 0.8: gain *= 0.5 # redundancy penalty
66
  if new_performance > 0.85: gain *= 0.7 # diminishing-returns cap
67
 
68
- noise_penalty = 0.4 * noise_ratio_of_selected_batch
 
69
 
70
  reward = gain
71
- - 0.01 * batch_size # budget cost
72
- - 0.3 * redundancy # cosine similarity to training centroid
73
  - noise_penalty
 
74
  ```
75
 
76
  ---
@@ -131,4 +132,3 @@ Scores below are from the fixed balanced agent (`uncertainty=0.4, diversity=0.4,
131
  | medium | 0.6600 | ✅ | 0.6569 |
132
  | hard | 0.4174 | ✅ | 0.6176 |
133
 
134
- Scores are from the fixed balanced agent (`uncertainty=0.4, diversity=0.4, random=0.2`, seed=42) via `GET /baseline`.
 
49
 
50
  | Name | flip_y | Budget | max_steps | Success criteria | Expected random score |
51
  |---|---|---|---|---|---|
52
+ | `easy` | 0.05 | 300 | 15 | performance > 0.62 | ~0.60 |
53
+ | `medium` | 0.25 | 150 | 12 | performance > 0.52 AND avg noise ratio < 0.50 | ~0.40 |
54
+ | `hard` | 0.30 | 100 | 8 | performance > 0.58 (scored jointly with budget efficiency) | ~0.30 |
55
 
56
  ---
57
 
58
  ## Reward Function
59
 
60
  ```
61
+ gain = (new_performance - old_performance) * 5.0
62
+ + mean(||selected_batch - train_centroid||) * 0.05 # diversity bonus
 
63
 
64
+ if redundancy > 0.8: gain *= 0.5 # redundancy penalty
65
  if new_performance > 0.85: gain *= 0.7 # diminishing-returns cap
66
 
67
+ noise_scale = 1.0 + flip_y * 2.0 # 1.1 easy | 1.5 medium | 1.6 hard
68
+ noise_penalty = noise_scale * noise_ratio_of_selected_batch
69
 
70
  reward = gain
71
+ - 0.01 * batch_size # budget cost
72
+ - 0.3 * redundancy # cosine similarity to training centroid
73
  - noise_penalty
74
+ + 0.15 # baseline offset (keeps signal in mixed-sign territory)
75
  ```
76
 
77
  ---
 
132
  | medium | 0.6600 | ✅ | 0.6569 |
133
  | hard | 0.4174 | ✅ | 0.6176 |
134