Spaces:

Pratyush-01
/

physix

Sleeping

App Files Files Community

physix / README.md

Pratyush-01

Upload folder using huggingface_hub

0e24aff verified 12 days ago

preview code

raw

history blame contribute delete

13.9 kB

	---
	title: PhysiX
	emoji: ⚛️
	colorFrom: blue
	colorTo: purple
	sdk: docker
	app_port: 7860
	pinned: false
	license: mit
	short_description: Equation discovery from noisy trajectories (RLVR)
	tags:
	- openenv
	- rlvr
	- physics
	- equation-discovery
	- ode
	base_path: /web
	---

	# PhysiX — Equation Discovery via RLVR

	An [OpenEnv](https://github.com/openenv-hackathon/openenv) hackathon submission (Apr 2026).

	Given a noisy trajectory and a one-sentence hint, a language model iteratively proposes and refines an ODE that reproduces the observed motion. Reward comes entirely from `scipy.integrate.odeint` + per-step R² — no LLM-as-judge.

	---

	## Links

	\| \| \|
	\|---\|---\|
	\| Live demo (HF Space) \| https://huggingface.co/spaces/Pratyush-01/physix-live \|
	\| Trained model \| https://huggingface.co/Pratyush-01/physix-3b-rl \|
	\| Colab training notebook \| https://huggingface.co/spaces/Pratyush-01/physix-live/blob/main/train/physix_train_colab.ipynb \|
	\| W&B training runs \| https://wandb.ai/pratyush01/physix-live \|
	\| Blog post / writeup \| https://huggingface.co/spaces/Pratyush-01/physix-live/blob/main/docs/blog.md \|
	\| Checkpoint repo \| https://huggingface.co/Pratyush-01/physix-3b-rl-ckpt \|

	---

	## Environment

	### Physical Systems — 3 Tiers

	6 training systems across 3 difficulty tiers, plus 2 held-out Tier 3 systems.

	\| Tier \| System \| Ground-truth equation \| Notes \|
	\|------\|--------\|-----------------------\|-------\|
	\| 1 \| Free Fall \| `d2y/dt2 = -g` \| 1 parameter \|
	\| 1 \| Free Fall with Drag \| `d2y/dt2 = -g + kvy*2` \| nonlinear drag \|
	\| 1 \| Simple Pendulum \| `d2theta/dt2 = -(g/L)*sin(theta)` \| transcendental \|
	\| 2 \| Damped Pendulum \| `d2theta/dt2 = -(g/L)sin(theta) - bdtheta` \| 3 parameters \|
	\| 2 \| Spring-Mass \| `d2x/dt2 = -(k/m)*x` \| parameter ratio \|
	\| 2 \| Damped Spring \| `d2x/dt2 = -(k/m)x - (b/m)dx` \| damped oscillation \|
	\| 3 (held-out) \| Projectile with Drag \| 2-D coupled ODE \| out-of-distribution \|
	\| 3 (held-out) \| Charged Particle in B-field \| 2-D Lorentz force \| cross-product coupling \|

	Parameters and initial conditions are randomised per episode.

	### Example Task

	```
	HINT: Object dropped from altitude 58.3 m, mass 1.8 kg, in air.
	Air resistance may be non-negligible.

	TRAJECTORY (t, y, vy):
	t=0.00 y=58.30 vy= 0.00
	t=0.50 y=56.89 vy=-5.44
	t=1.00 y=51.91 vy=-9.21
	t=2.00 y=35.70 vy=-13.88
	t=3.00 y=16.11 vy=-16.02
	t=5.00 y=-20.42 vy=-16.49 ← terminal velocity visible

	STATS: mean_vy=-10.85 std_vy=6.41 min_vy=-16.49
	```

	Target output:

	```json
	{
	"equation": "d2y/dt2 = -g + k * vy**2",
	"params": {"g": 9.81, "k": 0.047},
	"rationale": "Quadratic drag; vy² is positive regardless of sign of vy."
	}
	```

	Grammar: operators `+ - * / **`, functions `sin cos tan exp log sqrt abs`, declared state variables and parameter names. Anything outside this scores `format = 0`.

	### Episode Flow

	```mermaid
	sequenceDiagram
	participant Agent
	participant Env as PhysiXEnvironment
	participant Sim as scipy.odeint
	participant Verifier

	Env->>Agent: reset() → trajectory + hint
	loop up to 8 turns
	Agent->>Env: step(equation + params + rationale)
	Env->>Sim: simulate hypothesis from t=0
	Sim-->>Verifier: predicted trajectory
	Verifier-->>Env: r_match + r_progress + r_simplicity + r_format
	Env->>Agent: mismatch summary + reward breakdown + history
	alt r_match > 0.93 or budget exhausted
	Env-->>Agent: done=True
	end
	end
	```

	After each step the agent receives an English mismatch summary (e.g. "predicted vy diverges after t=2 s; residual consistently negative") alongside the numeric reward breakdown, so it has something to act on in the next turn.

	---

	## Reward

	All reward is computed from `scipy.odeint` output — no model-in-the-loop scoring.

	### Step-wise (live env + GRPO)

	\| Component \| Weight \| Formula \| Purpose \|
	\|-----------\|:------:\|---------\|---------\|
	\| `match` \| 0.50 \| R² (observed vs. predicted) \| primary accuracy signal \|
	\| `progress` \| 0.20 \| `max(0, r_match − r_match_prev)` \| per-turn improvement shaping \|
	\| `simplicity` \| 0.20 \| `1 − (operator_count / 12)` \| prefer shorter equations \|
	\| `format` \| 0.10 \| 1 if parsed and simulated successfully \| syntactic + numerical validity \|

	### GRPO-only additions

	Two extra signals are added during training but not used in the live env:

	- `match_dense = sqrt(R²)` — gives a non-trivial gradient when raw R² is near zero (e.g. `sqrt(0.05) ≈ 0.22`).
	- `correctness` = 1 if R² ≥ 0.70, else 0 — a binary bonus that helps push past R² plateaus where the dense signal flattens.

	### Reward-hacking mitigations

	Three failure modes found during development and how they were closed:

	1. Parse-but-crash exploit.
	A valid-but-explosive equation (e.g. `d2y/dt2 = exp(vy**10)`) parses but makes `odeint` produce NaN. Without a fix, it earns `format = 1`.
	→ `format = 1` only if integration completes without NaN/inf.

	2. Trivial-equation exploit.
	`d2y/dt2 = 0` has zero operators, so `simplicity = 1`, earning 20% reward for a completely wrong trajectory.
	→ `simplicity = 0` unless `r_match ≥ 0.10`.

	3. Progress signal in single-turn GRPO.
	Every GRPO training row starts with `previous_r_match = 0`, so `progress = r_match` — a redundant copy of the match signal that dilutes advantage estimates.
	→ `progress` is excluded from the GRPO reward function set; it is only used in multi-turn live episodes.

	---

	## Training: SFT → GRPO

	### Why SFT first

	GRPO relies on reward variance across rollouts to estimate advantages. With a cold base model, ~80% of completions are unparseable (LaTeX, prose, malformed JSON) and most parseable ones crash the integrator, leaving near-zero variance and no useful gradient. The model needs to produce the right output format before RL can do anything meaningful with the physics signal.

	SFT runs for 3 epochs on synthetic `(prompt, ground_truth_equation)` pairs generated from the environment. After SFT:
	- >90% of completions parse and simulate successfully (up from ~20%).
	- Equations are in the ASCII ODE grammar the verifier expects.
	- The model has seen the right equation family for each system at least once.

	SFT only establishes format. Parameter values are still wrong — that is what GRPO refines.

	### Step 1 — SFT warm-start

	```bash
	python -m physix.training.sft \
	--model Qwen/Qwen2.5-3B-Instruct \
	--output-dir runs/physix-3b-sft \
	--epochs 3 \
	--lora-r 32 \
	--instances-per-system 32 \
	--system-ids damped_spring
	```

	Runtime: ~5 min on L40S.

	### Step 2 — GRPO

	```bash
	python -m physix.training.loop \
	--model Qwen/Qwen2.5-3B-Instruct \
	--output-dir runs/physix-3b-rl \
	--num-steps 200 \
	--num-generations 4 \
	--lora-r 32 \
	--sft-checkpoint runs/physix-3b-sft/merged \
	--system-ids damped_spring \
	--push-to-hub \
	--hub-repo-id Pratyush-01/physix-3b-rl
	```

	Runtime: ~45 min on L40S.

	### Full cloud job

	```bash
	hf jobs uv run train/job_train_single.py \
	--image unsloth/unsloth:2026.3.8-pt2.9.0-vllm-0.16.0-cu12.8-studio-release \
	--flavor l40sx1 \
	--secrets HF_TOKEN \
	--secrets WANDB_API_KEY \
	-v hf://datasets/Pratyush-01/physix-live-src:/physix-live \
	--timeout 2h
	```

	---

	## Training Results

	\| GRPO Loss (↓) \| Total Reward (↑) \|
	\|:---:\|:---:\|
	\| ![loss](docs/plots/loss.png) \| ![reward](docs/plots/reward.png) \|

	\| Per-component reward breakdown \|
	\|:---:\|
	\| ![reward components](docs/plots/reward_components.png) \|

	W&B runs: [pratyush01/physix-live](https://wandb.ai/pratyush01/physix-live)

	Key observations from the run:
	- `reward/format` jumps to ~0.9 in the first 10 steps and holds there — the SFT warm-start does its job.
	- `reward/match_dense` (√R²) and `reward/correctness` both trend from ~0.6 to ~0.9–1.0 over 200 steps — the physics simulation is driving the gradient.
	- `reward/match` (raw R²) tracks match_dense and converges to ~0.95+ by step 150, indicating the model reliably proposes equations that fit the trajectory.
	- `reward/simplicity` rises gradually; the gate at R²≥0.10 prevents trivial-equation gaming (`d2y/dt2 = 0` can't earn simplicity reward).
	- Total mean reward rises from ~3.3 to ~4.8 (+45%) with ±1σ variance shrinking — the policy is both improving and becoming more consistent.

	What we don't claim:
	- The model generalises well to Tier 3 systems without further training.
	- A 3B model is competitive with frontier models on this task.
	- The reward improvements translate to meaningful physics understanding beyond the training distribution.

	---

	## Repository Layout

	```
	physix-live/
	├── physix/
	│ ├── models.py # Pydantic Action / Observation / State
	│ ├── client.py # OpenEnv WebSocket client
	│ ├── systems/ # physical systems
	│ │ ├── base.py # PhysicalSystem ABC
	│ │ ├── tier1.py # FreeFall, FreeFallWithDrag, SimplePendulum
	│ │ ├── tier2.py # DampedPendulum, SpringMass, DampedSpring
	│ │ ├── tier3.py # ProjectileWithDrag, ChargedInBField (held out)
	│ │ └── registry.py
	│ ├── verifier/
	│ │ ├── parser.py # SymPy whitelisted grammar
	│ │ ├── simulator.py # scipy.odeint forward simulation
	│ │ ├── metrics.py # per-step R²
	│ │ ├── mismatch.py # English residual summary
	│ │ └── reward.py # reward composition + hacking mitigations
	│ ├── server/
	│ │ ├── environment.py # PhysiXEnvironment (OpenEnv subclass)
	│ │ ├── interactive.py # session-based REST router for the UI
	│ │ └── app.py
	│ └── training/
	│ ├── prompt.py # observation → prompt
	│ ├── scorer.py # cached single-completion scorer
	│ ├── reward_fns.py # TRL-compatible reward callables
	│ ├── dataset.py # GRPO dataset builder
	│ ├── sft.py # SFT warm-start
	│ └── loop.py # Unsloth + TRL GRPO loop
	├── frontend/ # React + TS + Tailwind demo UI
	├── train/ # HF Jobs launcher + Colab notebook
	│ ├── submit.py # submit job via HfApi.run_uv_job
	│ ├── job_train.py # multi-system driver (in-container)
	│ ├── job_train_single.py # single-system driver (in-container)
	│ ├── physix_train_colab.ipynb # SFT → GRPO end-to-end notebook
	│ └── sync-plots.sh # mirror plots from model repo
	├── tests/ # ~30 tests
	├── docs/
	│ ├── plots/ # committed loss / reward / per-component PNGs
	│ └── writeup.md
	├── Dockerfile # env Space build (FastAPI + built React UI)
	├── openenv.yaml # OpenEnv manifest (name, runtime, app entrypoint)
	└── pyproject.toml
	```

	---

	## Quick Start

	One command from the repo root:

	```bash
	make dev
	```

	This starts the FastAPI backend on `:8000` (deps auto-resolved by `uv`) and the Vite frontend on `:5173`. Open [http://localhost:5173](http://localhost:5173).

	### Connecting an LLM

	The demo speaks to any OpenAI-compatible `/v1/chat/completions` endpoint — local Ollama, Hugging Face Inference Providers, OpenAI, vLLM, OpenRouter, etc. The "Connect an LLM" panel exposes:

	\| Field \| Purpose \|
	\|-------\|---------\|
	\| Endpoint \| Preset dropdown. Picks `base_url` + a default model id. \|
	\| Model \| Provider-native id (HF repo, Ollama tag, OpenAI name). Free-form. \|
	\| Custom base URL \| Shown when `Custom` is selected. Anything ending in `/v1`. \|
	\| API key \| Bearer token. Persisted per `base_url` in `localStorage`, never sent unless an episode runs. \|

	Server-side env-var fallback (lets a deployed Space ship a sensible default without leaking secrets in the bundle):

	\| URL family \| Env var \|
	\|---\|---\|
	\| `huggingface` \| `HF_TOKEN`, then `HUGGINGFACE_API_KEY` \|
	\| `openai.com` \| `OPENAI_API_KEY` \|
	\| `openrouter` \| `OPENROUTER_API_KEY` \|
	\| `localhost` / `127.0.0.1` \| none (Ollama needs no key) \|

	### Side-by-side comparison

	The default page is a two-column comparison: same trajectory, same hint, same seed, same verifier — two different models. The presets are wired to make the headline story self-evident:

	- A = `Pratyush-01/physix-3b-rl` via HF Inference Providers (the GRPO-trained model)
	- B = `Qwen/Qwen2.5-3B-Instruct` via HF Inference Providers (untrained baseline)

	Drop in `gpt-4o-mini` on either side as a frontier reference, or swap to local Ollama for offline dev. The reward delta between the two columns is exactly what GRPO bought — no benchmark-prose necessary.

	> For the trained model on HF Inference Providers: weights are public, but the repo card needs `inference: true` and a serving provider (Featherless/Together/etc.) to have it loaded. If a visitor sees a 404 from the trained side, they can either bring up `ollama serve` locally and pull a quantised version, or fall back to `Qwen/Qwen2.5-3B-Instruct` on both sides.

	### Programmatic use

	```python
	import asyncio
	from physix import PhysiXEnv, PhysiXAction

	async def main():
	async with PhysiXEnv(base_url="http://127.0.0.1:8000") as env:
	obs = await env.reset(system_id="free_fall_drag", seed=42)
	result = await env.step(
	PhysiXAction(equation="d2y/dt2 = -g + k * vy**2", params={"g": 9.81, "k": 0.05})
	)
	print(result.observation.reward_breakdown)

	asyncio.run(main())
	```

	```bash
	pytest tests/
	```

	---

	## License

	MIT.