Spaces:
Sleeping
PhysiX — Equation Discovery from Noisy Trajectories via RLVR
OpenEnv India Hackathon 2026 · Live Space · Trained Model · W&B Runs
The Problem
Given a short, noisy trajectory of a physical system — positions and velocities over time — can a language model discover the underlying equation of motion?
This is symbolic regression meets agentic RL. The challenge: the equation space is vast, noise means no trajectory perfectly fits any ODE, and the agent must learn to iterate — propose, simulate, compare residuals, refine. Classical tools (genetic programming, sparse regression) can do this, but they can't read a natural language hint or reason about failure in English. We train a 3B LLM to do it using RLVR.
The Environment
PhysiXEnvironment gives the agent a noisy trajectory from a physical system and asks it to output an ODE that reproduces the motion. All reward comes from scipy.odeint — no LLM-as-judge.
Systems
Three systems were used for training:
| System | Ground-truth ODE |
|---|---|
| Free Fall | d2y/dt2 = -g |
| Simple Pendulum | d2theta/dt2 = -(g/L)*sin(theta) |
| Damped Spring | d2x/dt2 = -(k/m)*x - (b/m)*dx |
Parameters and initial conditions are randomised per episode.
Episode flow
reset()→ agent receives a noisy trajectory + a one-sentence physical hint- Agent proposes an ODE in JSON:
{"equation": "...", "params": {...}} - Environment simulates it via
scipy.odeintand computes R² - Agent receives a residual summary in English + numeric reward breakdown
- Repeat up to 8 turns
Reward Design
All reward is computed from scipy.odeint — no model-in-the-loop scoring.
R² (coefficient of determination): R² = 1 is a perfect match, R² = 0 means no better than predicting the mean, R² < 0 is actively wrong.
| Component | Formula | What it rewards | Why it's needed |
|---|---|---|---|
match |
R² | Continuous fit quality | Primary learning signal |
match_dense |
√R² | Same, stretched | R² ≈ 0 early on; √R² gives non-zero gradient (√0.05 ≈ 0.22) so GRPO isn't blind in early steps |
correctness |
1 if R² ≥ 0.70 else 0 | Binary "good enough" | Creates a cliff the policy climbs; helps escape plateaus |
simplicity |
1 − operators/12, gated on R² ≥ 0.10 | Shorter equations | Without the gate, d2y/dt2 = 0 scores simplicity = 1 for free |
format |
1 if parses and odeint runs without NaN |
Valid, simulatable output | Without the NaN check, explosive equations like exp(vy**10) claim reward |
Training: SFT → GRPO
Why SFT first
Qwen2.5-3B out of the box produces LaTeX, prose, or malformed JSON on ~80% of turns — the verifier can't parse any of it. GRPO needs variance in reward across rollouts to estimate advantages; if every rollout scores ~0 because nothing parses, the gradient is zero and nothing learns.
SFT on synthetic (prompt, ground_truth_equation) pairs teaches the model the output format before RL begins. 4 epochs, ~5 min on L40S. After SFT, >90% of completions parse and simulate successfully — GRPO now has a signal to work with.
GRPO
- Model: Qwen/Qwen2.5-3B-Instruct + LoRA-32
- Systems: free_fall, simple_pendulum, damped_spring
- Steps: 200 (early stopping on reward convergence)
- LR: 1e-5
- Generations: 4 per prompt
- Framework: Unsloth + TRL GRPOTrainer
Results
Key observations:
reward_formatjumps to ~0.9 in the first 10 steps — the SFT warm-start does its job immediatelyreward_match_dense(√R²) andreward_correctnessclimb from ~0.6 → ~0.95 over 200 stepsreward_match(raw R²) converges to ~0.95+ by step 150- Total mean reward rises from ~3.3 → ~4.8 (+45%) with ±1σ variance shrinking
- GRPO loss near zero is expected — it's the KL regularisation term; the real signal is the reward curves
Full runs: wandb.ai/pratyush01/physix-live
What's Novel
- Verifiable reward without a judge — R² from
scipy.odeintis ground truth, not a proxy - Iterative refinement loop — the environment feeds residual summaries back in English so the agent can reason about what went wrong and refine
- Reward hacking case study — three exploits found and patched during development: parse-but-crash equations, trivial-equation simplicity farming, duplicate progress signal
- SFT → GRPO pipeline — shows how a cold 3B model can be made RL-trainable in under 10 minutes of SFT
Future Scope
The framework is system-agnostic — adding a new physical system means subclassing PhysicalSystem, implementing simulate and ground_truth_equation, and registering it. The verifier, reward function, and training loop need no changes.
Natural extensions:
- More complex dynamics — coupled oscillators, Lorenz attractor, reaction-diffusion, N-body gravity
- Larger models — the same pipeline runs on 7B; LoRA rank and LR need tuning but the reward design transfers directly
- Multi-turn curriculum — currently trains on turn-0 only; training on full refinement trajectories would teach the model to use residual feedback more effectively
- Noisier / sparser observations — current trajectories have moderate Gaussian noise; testing on sparser or higher-noise regimes would stress-test the R² reward
Links
- Live demo: https://huggingface.co/spaces/Pratyush-01/physix-live
- Trained model: https://huggingface.co/Pratyush-01/physix-3b-rl
- Training notebook: https://huggingface.co/spaces/Pratyush-01/physix-live/blob/main/train/physix_train_colab.ipynb
- W&B project: https://wandb.ai/pratyush01/physix-live



