File size: 6,248 Bytes
0e24aff
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
# PhysiX — Equation Discovery from Noisy Trajectories via RLVR

**OpenEnv India Hackathon 2026** · [Live Space](https://huggingface.co/spaces/Pratyush-01/physix-live) · [Trained Model](https://huggingface.co/Pratyush-01/physix-3b-rl) · [W&B Runs](https://wandb.ai/pratyush01/physix-live)

---

## The Problem

Given a short, noisy trajectory of a physical system — positions and velocities over time — can a language model discover the underlying equation of motion?

This is symbolic regression meets agentic RL. The challenge: the equation space is vast, noise means no trajectory perfectly fits any ODE, and the agent must learn to iterate — propose, simulate, compare residuals, refine. Classical tools (genetic programming, sparse regression) can do this, but they can't read a natural language hint or reason about failure in English. We train a 3B LLM to do it using RLVR.

---

## The Environment

**PhysiXEnvironment** gives the agent a noisy trajectory from a physical system and asks it to output an ODE that reproduces the motion. All reward comes from `scipy.odeint` — no LLM-as-judge.

### Systems

Three systems were used for training:

| System | Ground-truth ODE |
|--------|-----------------|
| Free Fall | `d2y/dt2 = -g` |
| Simple Pendulum | `d2theta/dt2 = -(g/L)*sin(theta)` |
| Damped Spring | `d2x/dt2 = -(k/m)*x - (b/m)*dx` |

Parameters and initial conditions are randomised per episode.

### Episode flow

1. `reset()` → agent receives a noisy trajectory + a one-sentence physical hint
2. Agent proposes an ODE in JSON: `{"equation": "...", "params": {...}}`
3. Environment simulates it via `scipy.odeint` and computes R²
4. Agent receives a residual summary in English + numeric reward breakdown
5. Repeat up to 8 turns

---

## Reward Design

All reward is computed from `scipy.odeint` — no model-in-the-loop scoring.

**R²** (coefficient of determination): R² = 1 is a perfect match, R² = 0 means no better than predicting the mean, R² < 0 is actively wrong.

| Component | Formula | What it rewards | Why it's needed |
|-----------|---------|-----------------|-----------------|
| `match` | R² | Continuous fit quality | Primary learning signal |
| `match_dense` | √R² | Same, stretched | R² ≈ 0 early on; √R² gives non-zero gradient (√0.05 ≈ 0.22) so GRPO isn't blind in early steps |
| `correctness` | 1 if R² ≥ 0.70 else 0 | Binary "good enough" | Creates a cliff the policy climbs; helps escape plateaus |
| `simplicity` | 1 − operators/12, gated on R² ≥ 0.10 | Shorter equations | Without the gate, `d2y/dt2 = 0` scores simplicity = 1 for free |
| `format` | 1 if parses **and** `odeint` runs without NaN | Valid, simulatable output | Without the NaN check, explosive equations like `exp(vy**10)` claim reward |

---

## Training: SFT → GRPO

### Why SFT first

Qwen2.5-3B out of the box produces LaTeX, prose, or malformed JSON on ~80% of turns — the verifier can't parse any of it. GRPO needs *variance in reward* across rollouts to estimate advantages; if every rollout scores ~0 because nothing parses, the gradient is zero and nothing learns.

SFT on synthetic `(prompt, ground_truth_equation)` pairs teaches the model the output format before RL begins. 4 epochs, ~5 min on L40S. After SFT, >90% of completions parse and simulate successfully — GRPO now has a signal to work with.

### GRPO

- **Model:** Qwen/Qwen2.5-3B-Instruct + LoRA-32
- **Systems:** free_fall, simple_pendulum, damped_spring
- **Steps:** 200 (early stopping on reward convergence)
- **LR:** 1e-5
- **Generations:** 4 per prompt
- **Framework:** Unsloth + TRL GRPOTrainer

---

## Results

| SFT Loss (↓) | GRPO Loss (↓) |
|:---:|:---:|
| ![SFT loss](plots/sft_loss.png) | ![GRPO loss](plots/loss.png) |

| GRPO Total Reward (↑) |
|:---:|
| ![reward](plots/reward.png) |

| Per-component reward breakdown |
|:---:|
| ![reward components](plots/reward_components.png) |

Key observations:
- `reward_format` jumps to ~0.9 in the first 10 steps — the SFT warm-start does its job immediately
- `reward_match_dense` (√R²) and `reward_correctness` climb from ~0.6 → ~0.95 over 200 steps
- `reward_match` (raw R²) converges to ~0.95+ by step 150
- Total mean reward rises from ~3.3 → ~4.8 (+45%) with ±1σ variance shrinking
- GRPO loss near zero is **expected** — it's the KL regularisation term; the real signal is the reward curves

Full runs: [wandb.ai/pratyush01/physix-live](https://wandb.ai/pratyush01/physix-live)

---

## What's Novel

1. **Verifiable reward without a judge** — R² from `scipy.odeint` is ground truth, not a proxy
2. **Iterative refinement loop** — the environment feeds residual summaries back in English so the agent can reason about what went wrong and refine
3. **Reward hacking case study** — three exploits found and patched during development: parse-but-crash equations, trivial-equation simplicity farming, duplicate progress signal
4. **SFT → GRPO pipeline** — shows how a cold 3B model can be made RL-trainable in under 10 minutes of SFT

---

## Future Scope

The framework is system-agnostic — adding a new physical system means subclassing `PhysicalSystem`, implementing `simulate` and `ground_truth_equation`, and registering it. The verifier, reward function, and training loop need no changes.

Natural extensions:

- **More complex dynamics** — coupled oscillators, Lorenz attractor, reaction-diffusion, N-body gravity
- **Larger models** — the same pipeline runs on 7B; LoRA rank and LR need tuning but the reward design transfers directly
- **Multi-turn curriculum** — currently trains on turn-0 only; training on full refinement trajectories would teach the model to use residual feedback more effectively
- **Noisier / sparser observations** — current trajectories have moderate Gaussian noise; testing on sparser or higher-noise regimes would stress-test the R² reward

---

## Links

- **Live demo:** https://huggingface.co/spaces/Pratyush-01/physix-live
- **Trained model:** https://huggingface.co/Pratyush-01/physix-3b-rl
- **Training notebook:** https://huggingface.co/spaces/Pratyush-01/physix-live/blob/main/train/physix_train_colab.ipynb
- **W&B project:** https://wandb.ai/pratyush01/physix-live