Spaces:
Sleeping
Sleeping
docs: expand reward section, remove held-out mention
Browse files- docs/writeup.md +20 -14
docs/writeup.md
CHANGED
|
@@ -29,8 +29,6 @@ Classical symbolic regression tools (GP, sparse regression) can do this, but the
|
|
| 29 |
| 2 | Damped Pendulum | `d2theta/dt2 = -(g/L)*sin(theta) - b*dtheta` |
|
| 30 |
| 2 | Spring-Mass | `d2x/dt2 = -(k/m)*x` |
|
| 31 |
| 2 | Damped Spring | `d2x/dt2 = -(k/m)*x - (b/m)*dx` |
|
| 32 |
-
| 3 *(held-out)* | Projectile with Drag | 2-D coupled ODE |
|
| 33 |
-
| 3 *(held-out)* | Charged Particle in B-field | 2-D Lorentz force |
|
| 34 |
|
| 35 |
Each episode: the environment samples random parameters and initial conditions, simulates a noisy trajectory, and presents it to the agent with a one-sentence hint. The agent proposes an equation + parameters; the environment simulates the hypothesis via `scipy.odeint` and computes per-step R².
|
| 36 |
|
|
@@ -40,20 +38,28 @@ Each episode: the environment samples random parameters and initial conditions,
|
|
| 40 |
|
| 41 |
## Reward Design
|
| 42 |
|
| 43 |
-
Five components
|
| 44 |
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
| `match` | R² (linear) | Primary accuracy |
|
| 48 |
-
| `match_dense` | √R² | Dense gradient near zero |
|
| 49 |
-
| `correctness` | Binary at R² ≥ 0.70 | Cliff signal |
|
| 50 |
-
| `simplicity` | 1 − operators/12 | Prefer shorter equations |
|
| 51 |
-
| `format` | Parse + simulate succeeds | Syntactic validity |
|
| 52 |
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
|
| 58 |
---
|
| 59 |
|
|
|
|
| 29 |
| 2 | Damped Pendulum | `d2theta/dt2 = -(g/L)*sin(theta) - b*dtheta` |
|
| 30 |
| 2 | Spring-Mass | `d2x/dt2 = -(k/m)*x` |
|
| 31 |
| 2 | Damped Spring | `d2x/dt2 = -(k/m)*x - (b/m)*dx` |
|
|
|
|
|
|
|
| 32 |
|
| 33 |
Each episode: the environment samples random parameters and initial conditions, simulates a noisy trajectory, and presents it to the agent with a one-sentence hint. The agent proposes an equation + parameters; the environment simulates the hypothesis via `scipy.odeint` and computes per-step R².
|
| 34 |
|
|
|
|
| 38 |
|
| 39 |
## Reward Design
|
| 40 |
|
| 41 |
+
Five components are summed per rollout. All are computed from `scipy.odeint` output — no model-in-the-loop scoring.
|
| 42 |
|
| 43 |
+
### `reward_match` — R² (coefficient of determination)
|
| 44 |
+
The primary signal. Compares the agent's simulated trajectory against the observed one using R². R² = 1 means perfect fit; R² = 0 means no better than predicting the mean. This is a continuous, differentiable signal that gives GRPO a smooth gradient to follow.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 45 |
|
| 46 |
+
### `reward_match_dense` — √R²
|
| 47 |
+
R² is near-zero for most early rollouts, making the gradient vanishingly small. Taking the square root stretches the low end of the range: `√0.05 ≈ 0.22` instead of `0.05`. This gives GRPO a non-trivial gradient even when the model is far from the right equation, speeding up early learning.
|
| 48 |
+
|
| 49 |
+
### `reward_correctness` — binary cliff at R² ≥ 0.70
|
| 50 |
+
A binary 0/1 bonus that fires when the trajectory is "good enough". This creates a sharp cliff in reward space that the policy is incentivised to climb. In practice it helps push past local plateaus where `match` and `match_dense` flatten out — the model learns to specifically target the 0.70 threshold rather than settling for partial credit.
|
| 51 |
+
|
| 52 |
+
### `reward_simplicity` — prefer shorter equations
|
| 53 |
+
Computed as `1 − (operator_count / 12)`. Shorter equations that explain the data are physically preferable to long, overfit ones. **Gated on R² ≥ 0.10** — if the equation is wrong, simplicity doesn't score. Without this gate the model would learn to output `d2y/dt2 = 0` (zero operators, simplicity = 1) and farm 20% reward for a completely wrong trajectory.
|
| 54 |
+
|
| 55 |
+
### `reward_format` — syntactic and numerical validity
|
| 56 |
+
1.0 only if the output (a) parses against the whitelisted ODE grammar, **and** (b) `odeint` integrates it to completion without NaN or overflow. Without the NaN check, a valid-looking but explosive equation like `d2y/dt2 = exp(vy**10)` would earn format reward despite being physically nonsense.
|
| 57 |
+
|
| 58 |
+
---
|
| 59 |
+
|
| 60 |
+
### Why these five together
|
| 61 |
+
|
| 62 |
+
The three correctness-shaped signals (`match`, `match_dense`, `correctness`) dominate the GRPO advantage, ensuring physical accuracy drives the gradient. `simplicity` prevents equation bloat. `format` ensures the output is always usable by the verifier. Together they make reward hacking significantly harder than any single signal would.
|
| 63 |
|
| 64 |
---
|
| 65 |
|