Spaces:

Pratyush-01
/

physix-live

Sleeping

App Files Files Community

physix-live / docs /blog.md

Pratyush-01

docs: blog edits

ed5a86f verified 12 days ago

preview code

raw

history blame contribute delete

9.34 kB

	# PhysiX — Equation Discovery from Noisy Trajectories via RLVR

	OpenEnv India Hackathon 2026 · [Live Space](https://huggingface.co/spaces/Pratyush-01/physix-live) · [Trained Model](https://huggingface.co/Pratyush-01/physix-3b-rl) · [W&B Runs](https://wandb.ai/pratyush01/physix-live)

	---

	## The Problem

	Given a short, noisy trajectory of a physical system — positions and velocities over time — can a language model discover the underlying equation of motion?

	This is symbolic regression meets agentic RL. The challenge: the equation space is vast, noise means no trajectory perfectly fits any ODE, and the agent must learn to iterate — propose, simulate, compare residuals, refine. Classical tools (genetic programming, sparse regression) can do this, but they can't read a natural language hint or reason about failure in English. We train a 3B LLM to do it using RLVR.

	---

	## The Environment

	PhysiXEnvironment gives the agent a noisy trajectory from a physical system and asks it to output an ODE that reproduces the motion. All reward comes from `scipy.odeint` — no LLM-as-judge.

	### Systems

	Three systems were used for training:


	\| System \| Ground-truth ODE \|
	\| --------------- \| --------------------------------- \|
	\| Free Fall \| `d2y/dt2 = -g` \|
	\| Simple Pendulum \| `d2theta/dt2 = -(g/L)*sin(theta)` \|
	\| Damped Spring \| `d2x/dt2 = -(k/m)x - (b/m)dx` \|


	Parameters and initial conditions are randomised per episode.

	### Episode flow

	1. `reset()` → agent receives a noisy trajectory + a one-sentence physical hint
	2. Agent proposes an ODE in JSON: `{"equation": "...", "params": {...}}`
	3. Environment simulates it via `scipy.odeint` and computes R²
	4. Agent receives a residual summary in English + numeric reward breakdown
	5. Repeat up to 8 turns

	---

	## Reward Design

	All reward is computed from `scipy.odeint` — no model-in-the-loop scoring.

	R² (coefficient of determination): R² = 1 is a perfect match, R² = 0 means no better than predicting the mean, R² < 0 is actively wrong.


	\| Component \| Formula \| What it rewards \| Why it's needed \|
	\| ------------- \| --------------------------------------------- \| ------------------------- \| ---------------------------------------------------------------------------------------------- \|
	\| `match` \| R² \| Continuous fit quality \| Primary learning signal \|
	\| `match_dense` \| √R² \| Same, stretched \| R² ≈ 0 early on; √R² gives non-zero gradient (√0.05 ≈ 0.22) so GRPO isn't blind in early steps \|
	\| `correctness` \| 1 if R² ≥ 0.70 else 0 \| Binary "good enough" \| Creates a cliff the policy climbs; helps escape plateaus \|
	\| `simplicity` \| 1 − operators/12, gated on R² ≥ 0.10 \| Shorter equations \| Without the gate, `d2y/dt2 = 0` scores simplicity = 1 for free \|
	\| `format` \| 1 if parses and `odeint` runs without NaN \| Valid, simulatable output \| Without the NaN check, explosive equations like `exp(vy**10)` claim reward \|


	---

	## Preventing Reward Hacking

	Three exploits showed up during early runs and were patched directly in the verifier — each is now an invariant the reward function enforces.


	\| Exploit \| What the model learned \| Patch \|
	\| ---------------------------- \| ----------------------------------------------------------------------------------------------------------------------------------------------- \| ---------------------------------------------------------------------------------------------------------------------------------------------------- \|
	\| Trivial-equation farming \| Emit `dx/dt = 0` — parses, simulates, scores `simplicity = 1` for free \| `simplicity` is gated on R² ≥ 0.10. No physical fit, no simplicity reward. \|
	\| Parse-but-crash \| Emit syntactically valid but explosive equations (`exp(vy10)`) that crash `odeint`; agent farmed `format` reward for "almost runnable" output \| `format = 1` requires both parse success and simulation success without NaN/inf**. Crash → all components zero. \|
	\| Out-of-grammar tokens \| Emit Python expressions outside the DSL (attribute access, lambdas, arbitrary calls) hoping the parser would accept them \| Parser is an AST whitelist: only `+ - * / **`, a fixed set of math functions, declared state vars and parameter names. Anything else → `format = 0`. \|


	A few additional invariants by construction -

	- No LLM-as-judge. Every component reduces to a deterministic function of the simulated trajectory and the parsed AST. There's no rubric the model can social-engineer.
	- Ground truth never surfaces to the agent. The equation and parameters live in `PhysiXState` for logging only; `PhysiXObservation` carries trajectory + hint + residual summary, nothing else. The model can't copy the answer back.
	- Per-episode randomised parameters. Memorising a single "answer" doesn't help — `g`, `k`, `m`, `b` are sampled fresh each `reset()`, so the model has to learn the form of the equation, not a specific numeric tuple.

	---

	## Training: SFT → GRPO

	### Why SFT first

	Qwen2.5-3B out of the box produces LaTeX, prose, or malformed JSON on ~80% of turns — the verifier can't parse any of it. GRPO needs variance in reward across rollouts to estimate advantages; if every rollout scores ~0 because nothing parses, the gradient is zero and nothing learns.

	SFT on synthetic `(prompt, ground_truth_equation)` pairs teaches the model the output format before RL begins. 4 epochs, ~5 min on L40S. After SFT, >90% of completions parse and simulate successfully — GRPO now has a signal to work with.

	### GRPO

	- Model: Qwen/Qwen2.5-3B-Instruct + LoRA-32
	- Systems: free_fall, simple_pendulum, damped_spring
	- Steps: 200 (early stopping on reward convergence)
	- LR: 1e-5
	- Generations: 4 per prompt
	- Framework: Unsloth + TRL GRPOTrainer

	---

	## Results


	\| SFT Loss (↓) \| GRPO Loss (↓) \|
	\|:---:\|:---:\|
	\| ![SFT loss](plots/sft_loss.png) \| ![GRPO loss](plots/loss.png) \|

	\| GRPO Total Reward (↑) \|
	\|:---:\|
	\| ![reward](plots/reward.png) \|


	Key observations:

	- Total mean reward rises from ~3.3 → ~4.8 (+45%) over 200 steps with ±1σ variance shrinking
	- The SFT warm-start does its job immediately — format compliance is high from step 1, so GRPO spends its budget improving R² rather than relearning JSON syntax
	- GRPO loss near zero is expected — it's the KL regularisation term; the real signal is the reward curves

	Full runs: [wandb.ai/pratyush01/physix-live](https://wandb.ai/pratyush01/physix-live)

	---

	## What's Novel

	1. Verifiable reward without a judge — R² from `scipy.odeint` is ground truth, not a proxy
	2. Iterative refinement loop — the environment feeds residual summaries back in English so the agent can reason about what went wrong and refine
	3. Reward hacking case study — three exploits found and patched during development: parse-but-crash equations, trivial-equation simplicity farming, duplicate progress signal
	4. SFT → GRPO pipeline — shows how a cold 3B model can be made RL-trainable in under 10 minutes of SFT

	---

	## Future Scope

	The framework is system-agnostic — adding a new physical system means subclassing `PhysicalSystem`, implementing `simulate` and `ground_truth_equation`, and registering it. The verifier, reward function, and training loop need no changes.

	Natural extensions:

	- More complex dynamics — coupled oscillators, Lorenz attractor, reaction-diffusion, N-body gravity
	- Larger models — the same pipeline runs on 7B; LoRA rank and LR need tuning but the reward design transfers directly
	- Noisier / sparser observations — current trajectories have moderate Gaussian noise; testing on sparser or higher-noise regimes would stress-test the R² reward

	---

	## Links

	- Live demo: [https://huggingface.co/spaces/Pratyush-01/physix-live](https://huggingface.co/spaces/Pratyush-01/physix-live)
	- Trained model: [https://huggingface.co/Pratyush-01/physix-3b-rl](https://huggingface.co/Pratyush-01/physix-3b-rl)
	- Training notebook: [https://huggingface.co/spaces/Pratyush-01/physix-live/blob/main/train/physix_train_colab.ipynb](https://huggingface.co/spaces/Pratyush-01/physix-live/blob/main/train/physix_train_colab.ipynb)
	- W&B project: [https://wandb.ai/pratyush01/physix-live](https://wandb.ai/pratyush01/physix-live)

	# PhysiX — Equation Discovery from Noisy Trajectories via RLVR

	OpenEnv India Hackathon 2026 · [Live Space](https://huggingface.co/spaces/Pratyush-01/physix-live) · [Trained Model](https://huggingface.co/Pratyush-01/physix-3b-rl) · [W&B Runs](https://wandb.ai/pratyush01/physix-live)

	---

	## The Problem

	Given a short, noisy trajectory of a physical system — positions and velocities over time — can a language model discover the underlying equation of motion?

	This is symbolic regression meets agentic RL. The challenge: the equation space is vast, noise means no trajectory perfectly fits any ODE, and the agent must learn to iterate — propose, simulate, compare residuals, refine. Classical tools (genetic programming, sparse regression) can do this, but they can't read a natural language hint or reason about failure in English. We train a 3B LLM to do it using RLVR.

	---

	## The Environment

	PhysiXEnvironment gives the agent a noisy trajectory from a physical system and asks it to output an ODE that reproduces the motion. All reward comes from `scipy.odeint` — no LLM-as-judge.

	### Systems

	Three systems were used for training:


	\| System \| Ground-truth ODE \|
	\| --------------- \| --------------------------------- \|
	\| Free Fall \| `d2y/dt2 = -g` \|
	\| Simple Pendulum \| `d2theta/dt2 = -(g/L)*sin(theta)` \|
	\| Damped Spring \| `d2x/dt2 = -(k/m)x - (b/m)dx` \|


	Parameters and initial conditions are randomised per episode.

	### Episode flow

	1. `reset()` → agent receives a noisy trajectory + a one-sentence physical hint
	2. Agent proposes an ODE in JSON: `{"equation": "...", "params": {...}}`
	3. Environment simulates it via `scipy.odeint` and computes R²
	4. Agent receives a residual summary in English + numeric reward breakdown
	5. Repeat up to 8 turns

	---

	## Reward Design

	All reward is computed from `scipy.odeint` — no model-in-the-loop scoring.

	R² (coefficient of determination): R² = 1 is a perfect match, R² = 0 means no better than predicting the mean, R² < 0 is actively wrong.


	\| Component \| Formula \| What it rewards \| Why it's needed \|
	\| ------------- \| --------------------------------------------- \| ------------------------- \| ---------------------------------------------------------------------------------------------- \|
	\| `match` \| R² \| Continuous fit quality \| Primary learning signal \|
	\| `match_dense` \| √R² \| Same, stretched \| R² ≈ 0 early on; √R² gives non-zero gradient (√0.05 ≈ 0.22) so GRPO isn't blind in early steps \|
	\| `correctness` \| 1 if R² ≥ 0.70 else 0 \| Binary "good enough" \| Creates a cliff the policy climbs; helps escape plateaus \|
	\| `simplicity` \| 1 − operators/12, gated on R² ≥ 0.10 \| Shorter equations \| Without the gate, `d2y/dt2 = 0` scores simplicity = 1 for free \|
	\| `format` \| 1 if parses and `odeint` runs without NaN \| Valid, simulatable output \| Without the NaN check, explosive equations like `exp(vy**10)` claim reward \|


	---

	## Preventing Reward Hacking

	Three exploits showed up during early runs and were patched directly in the verifier — each is now an invariant the reward function enforces.


	\| Exploit \| What the model learned \| Patch \|
	\| ---------------------------- \| ----------------------------------------------------------------------------------------------------------------------------------------------- \| ---------------------------------------------------------------------------------------------------------------------------------------------------- \|
	\| Trivial-equation farming \| Emit `dx/dt = 0` — parses, simulates, scores `simplicity = 1` for free \| `simplicity` is gated on R² ≥ 0.10. No physical fit, no simplicity reward. \|
	\| Parse-but-crash \| Emit syntactically valid but explosive equations (`exp(vy10)`) that crash `odeint`; agent farmed `format` reward for "almost runnable" output \| `format = 1` requires both parse success and simulation success without NaN/inf**. Crash → all components zero. \|
	\| Out-of-grammar tokens \| Emit Python expressions outside the DSL (attribute access, lambdas, arbitrary calls) hoping the parser would accept them \| Parser is an AST whitelist: only `+ - * / **`, a fixed set of math functions, declared state vars and parameter names. Anything else → `format = 0`. \|


	A few additional invariants by construction -

	- No LLM-as-judge. Every component reduces to a deterministic function of the simulated trajectory and the parsed AST. There's no rubric the model can social-engineer.
	- Ground truth never surfaces to the agent. The equation and parameters live in `PhysiXState` for logging only; `PhysiXObservation` carries trajectory + hint + residual summary, nothing else. The model can't copy the answer back.
	- Per-episode randomised parameters. Memorising a single "answer" doesn't help — `g`, `k`, `m`, `b` are sampled fresh each `reset()`, so the model has to learn the form of the equation, not a specific numeric tuple.

	---

	## Training: SFT → GRPO

	### Why SFT first

	Qwen2.5-3B out of the box produces LaTeX, prose, or malformed JSON on ~80% of turns — the verifier can't parse any of it. GRPO needs variance in reward across rollouts to estimate advantages; if every rollout scores ~0 because nothing parses, the gradient is zero and nothing learns.

	SFT on synthetic `(prompt, ground_truth_equation)` pairs teaches the model the output format before RL begins. 4 epochs, ~5 min on L40S. After SFT, >90% of completions parse and simulate successfully — GRPO now has a signal to work with.

	### GRPO

	- Model: Qwen/Qwen2.5-3B-Instruct + LoRA-32
	- Systems: free_fall, simple_pendulum, damped_spring
	- Steps: 200 (early stopping on reward convergence)
	- LR: 1e-5
	- Generations: 4 per prompt
	- Framework: Unsloth + TRL GRPOTrainer

	---

	## Results


	\| SFT Loss (↓) \| GRPO Loss (↓) \|
	\|:---:\|:---:\|
	\| ![SFT loss](plots/sft_loss.png) \| ![GRPO loss](plots/loss.png) \|

	\| GRPO Total Reward (↑) \|
	\|:---:\|
	\| ![reward](plots/reward.png) \|


	Key observations:

	- Total mean reward rises from ~3.3 → ~4.8 (+45%) over 200 steps with ±1σ variance shrinking
	- The SFT warm-start does its job immediately — format compliance is high from step 1, so GRPO spends its budget improving R² rather than relearning JSON syntax
	- GRPO loss near zero is expected — it's the KL regularisation term; the real signal is the reward curves

	Full runs: [wandb.ai/pratyush01/physix-live](https://wandb.ai/pratyush01/physix-live)

	---

	## What's Novel

	1. Verifiable reward without a judge — R² from `scipy.odeint` is ground truth, not a proxy
	2. Iterative refinement loop — the environment feeds residual summaries back in English so the agent can reason about what went wrong and refine
	3. Reward hacking case study — three exploits found and patched during development: parse-but-crash equations, trivial-equation simplicity farming, duplicate progress signal
	4. SFT → GRPO pipeline — shows how a cold 3B model can be made RL-trainable in under 10 minutes of SFT

	---

	## Future Scope

	The framework is system-agnostic — adding a new physical system means subclassing `PhysicalSystem`, implementing `simulate` and `ground_truth_equation`, and registering it. The verifier, reward function, and training loop need no changes.

	Natural extensions:

	- More complex dynamics — coupled oscillators, Lorenz attractor, reaction-diffusion, N-body gravity
	- Larger models — the same pipeline runs on 7B; LoRA rank and LR need tuning but the reward design transfers directly
	- Noisier / sparser observations — current trajectories have moderate Gaussian noise; testing on sparser or higher-noise regimes would stress-test the R² reward

	---

	## Links

	- Live demo: [https://huggingface.co/spaces/Pratyush-01/physix-live](https://huggingface.co/spaces/Pratyush-01/physix-live)
	- Trained model: [https://huggingface.co/Pratyush-01/physix-3b-rl](https://huggingface.co/Pratyush-01/physix-3b-rl)
	- Training notebook: [https://huggingface.co/spaces/Pratyush-01/physix-live/blob/main/train/physix_train_colab.ipynb](https://huggingface.co/spaces/Pratyush-01/physix-live/blob/main/train/physix_train_colab.ipynb)
	- W&B project: [https://wandb.ai/pratyush01/physix-live](https://wandb.ai/pratyush01/physix-live)