physix-live / docs /blog.md
Pratyush-01's picture
docs: blog edits
ed5a86f verified

PhysiX — Equation Discovery from Noisy Trajectories via RLVR

OpenEnv India Hackathon 2026 · Live Space · Trained Model · W&B Runs


The Problem

Given a short, noisy trajectory of a physical system — positions and velocities over time — can a language model discover the underlying equation of motion?

This is symbolic regression meets agentic RL. The challenge: the equation space is vast, noise means no trajectory perfectly fits any ODE, and the agent must learn to iterate — propose, simulate, compare residuals, refine. Classical tools (genetic programming, sparse regression) can do this, but they can't read a natural language hint or reason about failure in English. We train a 3B LLM to do it using RLVR.


The Environment

PhysiXEnvironment gives the agent a noisy trajectory from a physical system and asks it to output an ODE that reproduces the motion. All reward comes from scipy.odeint — no LLM-as-judge.

Systems

Three systems were used for training:

System Ground-truth ODE
Free Fall d2y/dt2 = -g
Simple Pendulum d2theta/dt2 = -(g/L)*sin(theta)
Damped Spring d2x/dt2 = -(k/m)*x - (b/m)*dx

Parameters and initial conditions are randomised per episode.

Episode flow

  1. reset() → agent receives a noisy trajectory + a one-sentence physical hint
  2. Agent proposes an ODE in JSON: {"equation": "...", "params": {...}}
  3. Environment simulates it via scipy.odeint and computes R²
  4. Agent receives a residual summary in English + numeric reward breakdown
  5. Repeat up to 8 turns

Reward Design

All reward is computed from scipy.odeint — no model-in-the-loop scoring.

(coefficient of determination): R² = 1 is a perfect match, R² = 0 means no better than predicting the mean, R² < 0 is actively wrong.

Component Formula What it rewards Why it's needed
match Continuous fit quality Primary learning signal
match_dense √R² Same, stretched R² ≈ 0 early on; √R² gives non-zero gradient (√0.05 ≈ 0.22) so GRPO isn't blind in early steps
correctness 1 if R² ≥ 0.70 else 0 Binary "good enough" Creates a cliff the policy climbs; helps escape plateaus
simplicity 1 − operators/12, gated on R² ≥ 0.10 Shorter equations Without the gate, d2y/dt2 = 0 scores simplicity = 1 for free
format 1 if parses and odeint runs without NaN Valid, simulatable output Without the NaN check, explosive equations like exp(vy**10) claim reward

Preventing Reward Hacking

Three exploits showed up during early runs and were patched directly in the verifier — each is now an invariant the reward function enforces.

Exploit What the model learned Patch
Trivial-equation farming Emit dx/dt = 0 — parses, simulates, scores simplicity = 1 for free simplicity is gated on R² ≥ 0.10. No physical fit, no simplicity reward.
Parse-but-crash Emit syntactically valid but explosive equations (exp(vy**10)) that crash odeint; agent farmed format reward for "almost runnable" output format = 1 requires both parse success and simulation success without NaN/inf. Crash → all components zero.
Out-of-grammar tokens Emit Python expressions outside the DSL (attribute access, lambdas, arbitrary calls) hoping the parser would accept them Parser is an AST whitelist: only + - * / **, a fixed set of math functions, declared state vars and parameter names. Anything else → format = 0.

A few additional invariants by construction -

  • No LLM-as-judge. Every component reduces to a deterministic function of the simulated trajectory and the parsed AST. There's no rubric the model can social-engineer.
  • Ground truth never surfaces to the agent. The equation and parameters live in PhysiXState for logging only; PhysiXObservation carries trajectory + hint + residual summary, nothing else. The model can't copy the answer back.
  • Per-episode randomised parameters. Memorising a single "answer" doesn't help — g, k, m, b are sampled fresh each reset(), so the model has to learn the form of the equation, not a specific numeric tuple.

Training: SFT → GRPO

Why SFT first

Qwen2.5-3B out of the box produces LaTeX, prose, or malformed JSON on ~80% of turns — the verifier can't parse any of it. GRPO needs variance in reward across rollouts to estimate advantages; if every rollout scores ~0 because nothing parses, the gradient is zero and nothing learns.

SFT on synthetic (prompt, ground_truth_equation) pairs teaches the model the output format before RL begins. 4 epochs, ~5 min on L40S. After SFT, >90% of completions parse and simulate successfully — GRPO now has a signal to work with.

GRPO

  • Model: Qwen/Qwen2.5-3B-Instruct + LoRA-32
  • Systems: free_fall, simple_pendulum, damped_spring
  • Steps: 200 (early stopping on reward convergence)
  • LR: 1e-5
  • Generations: 4 per prompt
  • Framework: Unsloth + TRL GRPOTrainer

Results

SFT Loss (↓) GRPO Loss (↓)
SFT loss GRPO loss
GRPO Total Reward (↑)
reward

Key observations:

  • Total mean reward rises from ~3.3 → ~4.8 (+45%) over 200 steps with ±1σ variance shrinking
  • The SFT warm-start does its job immediately — format compliance is high from step 1, so GRPO spends its budget improving R² rather than relearning JSON syntax
  • GRPO loss near zero is expected — it's the KL regularisation term; the real signal is the reward curves

Full runs: wandb.ai/pratyush01/physix-live


What's Novel

  1. Verifiable reward without a judge — R² from scipy.odeint is ground truth, not a proxy
  2. Iterative refinement loop — the environment feeds residual summaries back in English so the agent can reason about what went wrong and refine
  3. Reward hacking case study — three exploits found and patched during development: parse-but-crash equations, trivial-equation simplicity farming, duplicate progress signal
  4. SFT → GRPO pipeline — shows how a cold 3B model can be made RL-trainable in under 10 minutes of SFT

Future Scope

The framework is system-agnostic — adding a new physical system means subclassing PhysicalSystem, implementing simulate and ground_truth_equation, and registering it. The verifier, reward function, and training loop need no changes.

Natural extensions:

  • More complex dynamics — coupled oscillators, Lorenz attractor, reaction-diffusion, N-body gravity
  • Larger models — the same pipeline runs on 7B; LoRA rank and LR need tuning but the reward design transfers directly
  • Noisier / sparser observations — current trajectories have moderate Gaussian noise; testing on sparser or higher-noise regimes would stress-test the R² reward

Links