physix-live / README.md
Pratyush-01's picture
docs: README only lists 3 trained systems; drop tier-3 held-out claims and stale example
d6cb922 verified
metadata
title: PhysiX
emoji: ⚛️
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
license: mit
short_description: Equation discovery from noisy trajectories (RLVR)
tags:
  - openenv
  - rlvr
  - physics
  - equation-discovery
  - ode

PhysiX — Equation Discovery via RLVR

An OpenEnv hackathon submission (Apr 2026).

Given a noisy trajectory and a one-sentence hint, a language model iteratively proposes and refines an ODE that reproduces the observed motion. Reward comes entirely from scipy.integrate.odeint + per-step R² — no LLM-as-judge.


Links


Environment

Physical Systems

Three physical systems, each with randomised parameters and initial conditions per episode. These are the systems the published model is trained and evaluated on.

System Ground-truth equation Notes
Free Fall d2y/dt2 = -g 1 parameter
Simple Pendulum d2theta/dt2 = -(g/L)*sin(theta) transcendental
Damped Spring d2x/dt2 = -(k/m)*x - (b/m)*dx damped oscillation

The verifier and DSL are designed to extend cleanly to richer dynamics (coupled oscillators, Lorenz, reaction–diffusion, N-body). Adding a system is a matter of writing a PhysicalSystem subclass and registering it — no changes to the reward, parser, or training loop. See docs/blog.md → Future Scope for the extensibility plan.

Example Task

HINT: Mass on a spring, displaced 0.40 m and released.
      Visible amplitude decay over a few seconds.

TRAJECTORY (t, x, dx):
  t=0.00  x= 0.40  dx= 0.00
  t=0.50  x= 0.27  dx=-0.69
  t=1.00  x=-0.06  dx=-0.84
  t=1.50  x=-0.30  dx=-0.32
  t=2.00  x=-0.30  dx= 0.34
  t=3.00  x= 0.10  dx= 0.41
  t=5.00  x=-0.04  dx=-0.10   ← amplitude visibly shrinking

STATS: x_range=[-0.31, 0.40]  decay_envelope≈e^(-0.18 t)

Target output:

{
  "equation": "d2x/dt2 = -(k/m) * x - (b/m) * dx",
  "params": {"k": 4.0, "m": 1.0, "b": 0.36},
  "rationale": "Linear restoring force plus velocity-proportional damping; envelope decay matches b/(2m)."
}

Grammar: operators + - * / **, functions sin cos tan exp log sqrt abs, declared state variables and parameter names. Anything outside this scores format = 0.

Episode Flow

sequenceDiagram
    participant Agent
    participant Env as PhysiXEnvironment
    participant Sim as scipy.odeint
    participant Verifier

    Env->>Agent: reset() → trajectory + hint
    loop up to 8 turns
        Agent->>Env: step(equation + params + rationale)
        Env->>Sim: simulate hypothesis from t=0
        Sim-->>Verifier: predicted trajectory
        Verifier-->>Env: r_match + r_progress + r_simplicity + r_format
        Env->>Agent: mismatch summary + reward breakdown + history
        alt r_match > 0.93 or budget exhausted
            Env-->>Agent: done=True
        end
    end

After each step the agent receives an English mismatch summary (e.g. "predicted vy diverges after t=2 s; residual consistently negative") alongside the numeric reward breakdown, so it has something to act on in the next turn.


Reward

All reward is computed from scipy.odeint output — no model-in-the-loop scoring.

Step-wise (live env + GRPO)

Component Weight Formula Purpose
match 0.50 R² (observed vs. predicted) primary accuracy signal
progress 0.20 max(0, r_match − r_match_prev) per-turn improvement shaping
simplicity 0.20 1 − (operator_count / 12) prefer shorter equations
format 0.10 1 if parsed and simulated successfully syntactic + numerical validity

GRPO-only additions

Two extra signals are added during training but not used in the live env:

  • match_dense = sqrt(R²) — gives a non-trivial gradient when raw R² is near zero (e.g. sqrt(0.05) ≈ 0.22).
  • correctness = 1 if R² ≥ 0.70, else 0 — a binary bonus that helps push past R² plateaus where the dense signal flattens.

Reward-hacking mitigations

Three failure modes found during development and how they were closed:

1. Parse-but-crash exploit. A valid-but-explosive equation (e.g. d2y/dt2 = exp(vy**10)) parses but makes odeint produce NaN. Without a fix, it earns format = 1.
format = 1 only if integration completes without NaN/inf.

2. Trivial-equation exploit. d2y/dt2 = 0 has zero operators, so simplicity = 1, earning 20% reward for a completely wrong trajectory.
simplicity = 0 unless r_match ≥ 0.10.

3. Progress signal in single-turn GRPO. Every GRPO training row starts with previous_r_match = 0, so progress = r_match — a redundant copy of the match signal that dilutes advantage estimates.
progress is excluded from the GRPO reward function set; it is only used in multi-turn live episodes.


Training: SFT → GRPO

Why SFT first

GRPO relies on reward variance across rollouts to estimate advantages. With a cold base model, ~80% of completions are unparseable (LaTeX, prose, malformed JSON) and most parseable ones crash the integrator, leaving near-zero variance and no useful gradient. The model needs to produce the right output format before RL can do anything meaningful with the physics signal.

SFT runs for 3 epochs on synthetic (prompt, ground_truth_equation) pairs generated from the environment. After SFT:

  • 90% of completions parse and simulate successfully (up from ~20%).

  • Equations are in the ASCII ODE grammar the verifier expects.
  • The model has seen the right equation family for each system at least once.

SFT only establishes format. Parameter values are still wrong — that is what GRPO refines.

Step 1 — SFT warm-start

python -m physix.training.sft \
  --model Qwen/Qwen2.5-3B-Instruct \
  --output-dir runs/physix-3b-sft \
  --epochs 3 \
  --lora-r 32 \
  --instances-per-system 32 \
  --system-ids damped_spring

Runtime: ~5 min on L40S.

Step 2 — GRPO

python -m physix.training.loop \
  --model Qwen/Qwen2.5-3B-Instruct \
  --output-dir runs/physix-3b-rl \
  --num-steps 200 \
  --num-generations 4 \
  --lora-r 32 \
  --sft-checkpoint runs/physix-3b-sft/merged \
  --system-ids damped_spring \
  --push-to-hub \
  --hub-repo-id Pratyush-01/physix-3b-rl

Runtime: ~45 min on L40S.

Full cloud job

hf jobs uv run train/job_train_single.py \
    --image unsloth/unsloth:2026.3.8-pt2.9.0-vllm-0.16.0-cu12.8-studio-release \
    --flavor l40sx1 \
    --secrets HF_TOKEN \
    --secrets WANDB_API_KEY \
    -v hf://datasets/Pratyush-01/physix-live-src:/physix-live \
    --timeout 2h

Training Results

GRPO Loss (↓) Total Reward (↑)
loss reward

W&B runs: pratyush01/physix-live

Key observations from the run:

  • Total mean reward rises from ~3.3 to ~4.8 (+45%) over 200 steps with ±1σ variance shrinking — the policy is both improving and becoming more consistent.
  • The SFT warm-start gets format compliance high from step 1, so GRPO spends its budget improving R² rather than relearning JSON syntax.

Repository Layout

physix-live/
├── physix/
│   ├── models.py                 # Pydantic Action / Observation / State
│   ├── client.py                 # OpenEnv WebSocket client
│   ├── systems/                  # physical systems (3 trained, exposed via SUPPORTED_SYSTEMS)
│   │   ├── base.py               # PhysicalSystem ABC
│   │   ├── tier1.py              # FreeFall, SimplePendulum (+ extras for future work)
│   │   ├── tier2.py              # DampedSpring (+ extras for future work)
│   │   ├── tier3.py              # placeholders for future extensions, not exposed
│   │   └── registry.py
│   ├── verifier/
│   │   ├── parser.py             # SymPy whitelisted grammar
│   │   ├── simulator.py          # scipy.odeint forward simulation
│   │   ├── metrics.py            # per-step R²
│   │   ├── mismatch.py           # English residual summary
│   │   └── reward.py             # reward composition + hacking mitigations
│   ├── server/
│   │   ├── environment.py        # PhysiXEnvironment (OpenEnv subclass)
│   │   ├── interactive.py        # session-based REST router for the UI
│   │   └── app.py
│   └── training/
│       ├── prompt.py             # observation → prompt
│       ├── scorer.py             # cached single-completion scorer
│       ├── reward_fns.py         # TRL-compatible reward callables
│       ├── dataset.py            # GRPO dataset builder
│       ├── sft.py                # SFT warm-start
│       └── loop.py               # Unsloth + TRL GRPO loop
├── frontend/                     # React + TS + Tailwind demo UI
├── train/                        # HF Jobs launcher + Colab notebook
│   ├── submit.py                 # submit job via HfApi.run_uv_job
│   ├── job_train.py              # multi-system driver (in-container)
│   ├── job_train_single.py       # single-system driver (in-container)
│   ├── physix_train_colab.ipynb  # SFT → GRPO end-to-end notebook
│   └── sync-plots.sh             # mirror plots from model repo
├── tests/                        # ~30 tests
├── docs/
│   ├── plots/                    # committed loss / reward / per-component PNGs
│   └── writeup.md
├── Dockerfile                    # env Space build (FastAPI + built React UI)
├── openenv.yaml                  # OpenEnv manifest (name, runtime, app entrypoint)
└── pyproject.toml

Quick Start

One command from the repo root:

make dev

This starts the FastAPI backend on :8000 (deps auto-resolved by uv) and the Vite frontend on :5173. Open http://localhost:5173.

Connecting an LLM

The demo speaks to any OpenAI-compatible /v1/chat/completions endpoint — local Ollama, Hugging Face Inference Providers, OpenAI, vLLM, OpenRouter, etc. The "Connect an LLM" panel exposes:

Field Purpose
Endpoint Preset dropdown. Picks base_url + a default model id.
Model Provider-native id (HF repo, Ollama tag, OpenAI name). Free-form.
Custom base URL Shown when Custom is selected. Anything ending in /v1.
API key Bearer token. Persisted per base_url in localStorage, never sent unless an episode runs.

Server-side env-var fallback (lets a deployed Space ship a sensible default without leaking secrets in the bundle):

URL family Env var
*huggingface* HF_TOKEN, then HUGGINGFACE_API_KEY
*openai.com* OPENAI_API_KEY
*openrouter* OPENROUTER_API_KEY
localhost / 127.0.0.1 none (Ollama needs no key)

Side-by-side comparison

The default page is a two-column comparison: same trajectory, same hint, same seed, same verifier — two different models. The presets are wired to make the headline story self-evident:

  • A = Pratyush-01/physix-3b-rl via HF Inference Providers (the GRPO-trained model)
  • B = Qwen/Qwen2.5-3B-Instruct via HF Inference Providers (untrained baseline)

Drop in gpt-4o-mini on either side as a frontier reference, or swap to local Ollama for offline dev. The reward delta between the two columns is exactly what GRPO bought — no benchmark-prose necessary.

For the trained model on HF Inference Providers: weights are public, but the repo card needs inference: true and a serving provider (Featherless/Together/etc.) to have it loaded. If a visitor sees a 404 from the trained side, they can either bring up ollama serve locally and pull a quantised version, or fall back to Qwen/Qwen2.5-3B-Instruct on both sides.

Programmatic use

import asyncio
from physix import PhysiXEnv, PhysiXAction

async def main():
    async with PhysiXEnv(base_url="http://127.0.0.1:8000") as env:
        obs = await env.reset(system_id="damped_spring", seed=42)
        result = await env.step(
            PhysiXAction(
                equation="d2x/dt2 = -(k/m) * x - (b/m) * dx",
                params={"k": 4.0, "m": 1.0, "b": 0.36},
            )
        )
        print(result.observation.reward_breakdown)

asyncio.run(main())
pytest tests/

License

MIT.