Spaces:
Sleeping
title: PhysiX
emoji: ⚛️
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
license: mit
short_description: Equation discovery from noisy trajectories (RLVR)
tags:
- openenv
- rlvr
- physics
- equation-discovery
- ode
PhysiX — Equation Discovery via RLVR
An OpenEnv hackathon submission (Apr 2026).
Given a noisy trajectory and a one-sentence hint, a language model iteratively proposes and refines an ODE that reproduces the observed motion. Reward comes entirely from scipy.integrate.odeint + per-step R² — no LLM-as-judge.
Links
| Live demo (HF Space) | https://huggingface.co/spaces/Pratyush-01/physix-live |
| Trained model | https://huggingface.co/Pratyush-01/physix-3b-rl |
| Colab training notebook | https://huggingface.co/spaces/Pratyush-01/physix-live/blob/main/train/physix_train_colab.ipynb |
| W&B training runs | https://wandb.ai/pratyush01/physix-live |
| Blog post / writeup | https://huggingface.co/spaces/Pratyush-01/physix-live/blob/main/docs/blog.md |
| Checkpoint repo | https://huggingface.co/Pratyush-01/physix-3b-rl-ckpt |
Environment
Physical Systems
Three physical systems, each with randomised parameters and initial conditions per episode. These are the systems the published model is trained and evaluated on.
| System | Ground-truth equation | Notes |
|---|---|---|
| Free Fall | d2y/dt2 = -g |
1 parameter |
| Simple Pendulum | d2theta/dt2 = -(g/L)*sin(theta) |
transcendental |
| Damped Spring | d2x/dt2 = -(k/m)*x - (b/m)*dx |
damped oscillation |
The verifier and DSL are designed to extend cleanly to richer dynamics (coupled oscillators, Lorenz, reaction–diffusion, N-body). Adding a system is a matter of writing a PhysicalSystem subclass and registering it — no changes to the reward, parser, or training loop. See docs/blog.md → Future Scope for the extensibility plan.
Example Task
HINT: Mass on a spring, displaced 0.40 m and released.
Visible amplitude decay over a few seconds.
TRAJECTORY (t, x, dx):
t=0.00 x= 0.40 dx= 0.00
t=0.50 x= 0.27 dx=-0.69
t=1.00 x=-0.06 dx=-0.84
t=1.50 x=-0.30 dx=-0.32
t=2.00 x=-0.30 dx= 0.34
t=3.00 x= 0.10 dx= 0.41
t=5.00 x=-0.04 dx=-0.10 ← amplitude visibly shrinking
STATS: x_range=[-0.31, 0.40] decay_envelope≈e^(-0.18 t)
Target output:
{
"equation": "d2x/dt2 = -(k/m) * x - (b/m) * dx",
"params": {"k": 4.0, "m": 1.0, "b": 0.36},
"rationale": "Linear restoring force plus velocity-proportional damping; envelope decay matches b/(2m)."
}
Grammar: operators + - * / **, functions sin cos tan exp log sqrt abs, declared state variables and parameter names. Anything outside this scores format = 0.
Episode Flow
sequenceDiagram
participant Agent
participant Env as PhysiXEnvironment
participant Sim as scipy.odeint
participant Verifier
Env->>Agent: reset() → trajectory + hint
loop up to 8 turns
Agent->>Env: step(equation + params + rationale)
Env->>Sim: simulate hypothesis from t=0
Sim-->>Verifier: predicted trajectory
Verifier-->>Env: r_match + r_progress + r_simplicity + r_format
Env->>Agent: mismatch summary + reward breakdown + history
alt r_match > 0.93 or budget exhausted
Env-->>Agent: done=True
end
end
After each step the agent receives an English mismatch summary (e.g. "predicted vy diverges after t=2 s; residual consistently negative") alongside the numeric reward breakdown, so it has something to act on in the next turn.
Reward
All reward is computed from scipy.odeint output — no model-in-the-loop scoring.
Step-wise (live env + GRPO)
| Component | Weight | Formula | Purpose |
|---|---|---|---|
match |
0.50 | R² (observed vs. predicted) | primary accuracy signal |
progress |
0.20 | max(0, r_match − r_match_prev) |
per-turn improvement shaping |
simplicity |
0.20 | 1 − (operator_count / 12) |
prefer shorter equations |
format |
0.10 | 1 if parsed and simulated successfully | syntactic + numerical validity |
GRPO-only additions
Two extra signals are added during training but not used in the live env:
match_dense = sqrt(R²)— gives a non-trivial gradient when raw R² is near zero (e.g.sqrt(0.05) ≈ 0.22).correctness= 1 if R² ≥ 0.70, else 0 — a binary bonus that helps push past R² plateaus where the dense signal flattens.
Reward-hacking mitigations
Three failure modes found during development and how they were closed:
1. Parse-but-crash exploit.
A valid-but-explosive equation (e.g. d2y/dt2 = exp(vy**10)) parses but makes odeint produce NaN. Without a fix, it earns format = 1.
→ format = 1 only if integration completes without NaN/inf.
2. Trivial-equation exploit.
d2y/dt2 = 0 has zero operators, so simplicity = 1, earning 20% reward for a completely wrong trajectory.
→ simplicity = 0 unless r_match ≥ 0.10.
3. Progress signal in single-turn GRPO.
Every GRPO training row starts with previous_r_match = 0, so progress = r_match — a redundant copy of the match signal that dilutes advantage estimates.
→ progress is excluded from the GRPO reward function set; it is only used in multi-turn live episodes.
Training: SFT → GRPO
Why SFT first
GRPO relies on reward variance across rollouts to estimate advantages. With a cold base model, ~80% of completions are unparseable (LaTeX, prose, malformed JSON) and most parseable ones crash the integrator, leaving near-zero variance and no useful gradient. The model needs to produce the right output format before RL can do anything meaningful with the physics signal.
SFT runs for 3 epochs on synthetic (prompt, ground_truth_equation) pairs generated from the environment. After SFT:
90% of completions parse and simulate successfully (up from ~20%).
- Equations are in the ASCII ODE grammar the verifier expects.
- The model has seen the right equation family for each system at least once.
SFT only establishes format. Parameter values are still wrong — that is what GRPO refines.
Step 1 — SFT warm-start
python -m physix.training.sft \
--model Qwen/Qwen2.5-3B-Instruct \
--output-dir runs/physix-3b-sft \
--epochs 3 \
--lora-r 32 \
--instances-per-system 32 \
--system-ids damped_spring
Runtime: ~5 min on L40S.
Step 2 — GRPO
python -m physix.training.loop \
--model Qwen/Qwen2.5-3B-Instruct \
--output-dir runs/physix-3b-rl \
--num-steps 200 \
--num-generations 4 \
--lora-r 32 \
--sft-checkpoint runs/physix-3b-sft/merged \
--system-ids damped_spring \
--push-to-hub \
--hub-repo-id Pratyush-01/physix-3b-rl
Runtime: ~45 min on L40S.
Full cloud job
hf jobs uv run train/job_train_single.py \
--image unsloth/unsloth:2026.3.8-pt2.9.0-vllm-0.16.0-cu12.8-studio-release \
--flavor l40sx1 \
--secrets HF_TOKEN \
--secrets WANDB_API_KEY \
-v hf://datasets/Pratyush-01/physix-live-src:/physix-live \
--timeout 2h
Training Results
W&B runs: pratyush01/physix-live
Key observations from the run:
- Total mean reward rises from ~3.3 to ~4.8 (+45%) over 200 steps with ±1σ variance shrinking — the policy is both improving and becoming more consistent.
- The SFT warm-start gets format compliance high from step 1, so GRPO spends its budget improving R² rather than relearning JSON syntax.
Repository Layout
physix-live/
├── physix/
│ ├── models.py # Pydantic Action / Observation / State
│ ├── client.py # OpenEnv WebSocket client
│ ├── systems/ # physical systems (3 trained, exposed via SUPPORTED_SYSTEMS)
│ │ ├── base.py # PhysicalSystem ABC
│ │ ├── tier1.py # FreeFall, SimplePendulum (+ extras for future work)
│ │ ├── tier2.py # DampedSpring (+ extras for future work)
│ │ ├── tier3.py # placeholders for future extensions, not exposed
│ │ └── registry.py
│ ├── verifier/
│ │ ├── parser.py # SymPy whitelisted grammar
│ │ ├── simulator.py # scipy.odeint forward simulation
│ │ ├── metrics.py # per-step R²
│ │ ├── mismatch.py # English residual summary
│ │ └── reward.py # reward composition + hacking mitigations
│ ├── server/
│ │ ├── environment.py # PhysiXEnvironment (OpenEnv subclass)
│ │ ├── interactive.py # session-based REST router for the UI
│ │ └── app.py
│ └── training/
│ ├── prompt.py # observation → prompt
│ ├── scorer.py # cached single-completion scorer
│ ├── reward_fns.py # TRL-compatible reward callables
│ ├── dataset.py # GRPO dataset builder
│ ├── sft.py # SFT warm-start
│ └── loop.py # Unsloth + TRL GRPO loop
├── frontend/ # React + TS + Tailwind demo UI
├── train/ # HF Jobs launcher + Colab notebook
│ ├── submit.py # submit job via HfApi.run_uv_job
│ ├── job_train.py # multi-system driver (in-container)
│ ├── job_train_single.py # single-system driver (in-container)
│ ├── physix_train_colab.ipynb # SFT → GRPO end-to-end notebook
│ └── sync-plots.sh # mirror plots from model repo
├── tests/ # ~30 tests
├── docs/
│ ├── plots/ # committed loss / reward / per-component PNGs
│ └── writeup.md
├── Dockerfile # env Space build (FastAPI + built React UI)
├── openenv.yaml # OpenEnv manifest (name, runtime, app entrypoint)
└── pyproject.toml
Quick Start
One command from the repo root:
make dev
This starts the FastAPI backend on :8000 (deps auto-resolved by uv) and the Vite frontend on :5173. Open http://localhost:5173.
Connecting an LLM
The demo speaks to any OpenAI-compatible /v1/chat/completions endpoint — local Ollama, Hugging Face Inference Providers, OpenAI, vLLM, OpenRouter, etc. The "Connect an LLM" panel exposes:
| Field | Purpose |
|---|---|
| Endpoint | Preset dropdown. Picks base_url + a default model id. |
| Model | Provider-native id (HF repo, Ollama tag, OpenAI name). Free-form. |
| Custom base URL | Shown when Custom is selected. Anything ending in /v1. |
| API key | Bearer token. Persisted per base_url in localStorage, never sent unless an episode runs. |
Server-side env-var fallback (lets a deployed Space ship a sensible default without leaking secrets in the bundle):
| URL family | Env var |
|---|---|
*huggingface* |
HF_TOKEN, then HUGGINGFACE_API_KEY |
*openai.com* |
OPENAI_API_KEY |
*openrouter* |
OPENROUTER_API_KEY |
localhost / 127.0.0.1 |
none (Ollama needs no key) |
Side-by-side comparison
The default page is a two-column comparison: same trajectory, same hint, same seed, same verifier — two different models. The presets are wired to make the headline story self-evident:
- A =
Pratyush-01/physix-3b-rlvia HF Inference Providers (the GRPO-trained model) - B =
Qwen/Qwen2.5-3B-Instructvia HF Inference Providers (untrained baseline)
Drop in gpt-4o-mini on either side as a frontier reference, or swap to local Ollama for offline dev. The reward delta between the two columns is exactly what GRPO bought — no benchmark-prose necessary.
For the trained model on HF Inference Providers: weights are public, but the repo card needs
inference: trueand a serving provider (Featherless/Together/etc.) to have it loaded. If a visitor sees a 404 from the trained side, they can either bring upollama servelocally and pull a quantised version, or fall back toQwen/Qwen2.5-3B-Instructon both sides.
Programmatic use
import asyncio
from physix import PhysiXEnv, PhysiXAction
async def main():
async with PhysiXEnv(base_url="http://127.0.0.1:8000") as env:
obs = await env.reset(system_id="damped_spring", seed=42)
result = await env.step(
PhysiXAction(
equation="d2x/dt2 = -(k/m) * x - (b/m) * dx",
params={"k": 4.0, "m": 1.0, "b": 0.36},
)
)
print(result.observation.reward_breakdown)
asyncio.run(main())
pytest tests/
License
MIT.

