Spaces:
Sleeping
Sleeping
docs: tighten reward section into table
Browse files- docs/writeup.md +9 -20
docs/writeup.md
CHANGED
|
@@ -38,28 +38,17 @@ Each episode: the environment samples random parameters and initial conditions,
|
|
| 38 |
|
| 39 |
## Reward Design
|
| 40 |
|
| 41 |
-
|
| 42 |
|
| 43 |
-
|
| 44 |
-
The primary signal. Compares the agent's simulated trajectory against the observed one using R². R² = 1 means perfect fit; R² = 0 means no better than predicting the mean. This is a continuous, differentiable signal that gives GRPO a smooth gradient to follow.
|
| 45 |
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
Computed as `1 − (operator_count / 12)`. Shorter equations that explain the data are physically preferable to long, overfit ones. **Gated on R² ≥ 0.10** — if the equation is wrong, simplicity doesn't score. Without this gate the model would learn to output `d2y/dt2 = 0` (zero operators, simplicity = 1) and farm 20% reward for a completely wrong trajectory.
|
| 54 |
-
|
| 55 |
-
### `reward_format` — syntactic and numerical validity
|
| 56 |
-
1.0 only if the output (a) parses against the whitelisted ODE grammar, **and** (b) `odeint` integrates it to completion without NaN or overflow. Without the NaN check, a valid-looking but explosive equation like `d2y/dt2 = exp(vy**10)` would earn format reward despite being physically nonsense.
|
| 57 |
-
|
| 58 |
-
---
|
| 59 |
-
|
| 60 |
-
### Why these five together
|
| 61 |
-
|
| 62 |
-
The three correctness-shaped signals (`match`, `match_dense`, `correctness`) dominate the GRPO advantage, ensuring physical accuracy drives the gradient. `simplicity` prevents equation bloat. `format` ensures the output is always usable by the verifier. Together they make reward hacking significantly harder than any single signal would.
|
| 63 |
|
| 64 |
---
|
| 65 |
|
|
|
|
| 38 |
|
| 39 |
## Reward Design
|
| 40 |
|
| 41 |
+
All reward is computed from `scipy.odeint` output — no model-in-the-loop scoring. Five components are summed per rollout:
|
| 42 |
|
| 43 |
+
**R²** (coefficient of determination) measures trajectory fit: R² = 1 is a perfect match, R² = 0 is no better than predicting the mean, R² < 0 is actively wrong.
|
|
|
|
| 44 |
|
| 45 |
+
| Component | Formula | What it rewards | Why it's needed |
|
| 46 |
+
|-----------|---------|-----------------|-----------------|
|
| 47 |
+
| `match` | R² | Continuous fit quality | Primary learning signal |
|
| 48 |
+
| `match_dense` | √R² | Same, stretched | R² ≈ 0 early on; √R² gives a non-zero gradient (√0.05 ≈ 0.22) so GRPO isn't blind in early steps |
|
| 49 |
+
| `correctness` | 1 if R² ≥ 0.70 else 0 | Binary "good enough" | Creates a cliff the policy actively climbs; helps escape plateaus where the dense signal flattens |
|
| 50 |
+
| `simplicity` | 1 − operators/12, gated on R² ≥ 0.10 | Shorter equations | Without the gate, `d2y/dt2 = 0` scores simplicity = 1 for free despite being completely wrong |
|
| 51 |
+
| `format` | 1 if parses **and** `odeint` succeeds without NaN | Valid, simulatable output | Without the NaN check, explosive equations like `exp(vy**10)` parse successfully and claim format reward |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
|
| 53 |
---
|
| 54 |
|