Spaces:

Pratyush-01
/

physix-live

Sleeping

App Files Files Community

physix-live / README.md

Pratyush-01

docs: README only lists 3 trained systems; drop tier-3 held-out claims and stale example

d6cb922 verified 12 days ago

preview code

raw

history blame contribute delete

13.3 kB

metadata

title: PhysiX
emoji: ⚛️
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
license: mit
short_description: Equation discovery from noisy trajectories (RLVR)
tags:
  - openenv
  - rlvr
  - physics
  - equation-discovery
  - ode

PhysiX — Equation Discovery via RLVR

An OpenEnv hackathon submission (Apr 2026).

Given a noisy trajectory and a one-sentence hint, a language model iteratively proposes and refines an ODE that reproduces the observed motion. Reward comes entirely from scipy.integrate.odeint + per-step R² — no LLM-as-judge.

Links


Live demo (HF Space)	https://huggingface.co/spaces/Pratyush-01/physix-live
Trained model	https://huggingface.co/Pratyush-01/physix-3b-rl
Colab training notebook	https://huggingface.co/spaces/Pratyush-01/physix-live/blob/main/train/physix_train_colab.ipynb
W&B training runs	https://wandb.ai/pratyush01/physix-live
Blog post / writeup	https://huggingface.co/spaces/Pratyush-01/physix-live/blob/main/docs/blog.md
Checkpoint repo	https://huggingface.co/Pratyush-01/physix-3b-rl-ckpt

Environment

Physical Systems

Three physical systems, each with randomised parameters and initial conditions per episode. These are the systems the published model is trained and evaluated on.

System	Ground-truth equation	Notes
Free Fall	`d2y/dt2 = -g`	1 parameter
Simple Pendulum	`d2theta/dt2 = -(g/L)*sin(theta)`	transcendental
Damped Spring	`d2x/dt2 = -(k/m)x - (b/m)dx`	damped oscillation

The verifier and DSL are designed to extend cleanly to richer dynamics (coupled oscillators, Lorenz, reaction–diffusion, N-body). Adding a system is a matter of writing a PhysicalSystem subclass and registering it — no changes to the reward, parser, or training loop. See docs/blog.md → Future Scope for the extensibility plan.

Example Task

HINT: Mass on a spring, displaced 0.40 m and released.
      Visible amplitude decay over a few seconds.

TRAJECTORY (t, x, dx):
  t=0.00  x= 0.40  dx= 0.00
  t=0.50  x= 0.27  dx=-0.69
  t=1.00  x=-0.06  dx=-0.84
  t=1.50  x=-0.30  dx=-0.32
  t=2.00  x=-0.30  dx= 0.34
  t=3.00  x= 0.10  dx= 0.41
  t=5.00  x=-0.04  dx=-0.10   ← amplitude visibly shrinking

STATS: x_range=[-0.31, 0.40]  decay_envelope≈e^(-0.18 t)

Target output:

{
  "equation": "d2x/dt2 = -(k/m) * x - (b/m) * dx",
  "params": {"k": 4.0, "m": 1.0, "b": 0.36},
  "rationale": "Linear restoring force plus velocity-proportional damping; envelope decay matches b/(2m)."
}

Grammar: operators + - * / **, functions sin cos tan exp log sqrt abs, declared state variables and parameter names. Anything outside this scores format = 0.

Episode Flow

sequenceDiagram
    participant Agent
    participant Env as PhysiXEnvironment
    participant Sim as scipy.odeint
    participant Verifier

    Env->>Agent: reset() → trajectory + hint
    loop up to 8 turns
        Agent->>Env: step(equation + params + rationale)
        Env->>Sim: simulate hypothesis from t=0
        Sim-->>Verifier: predicted trajectory
        Verifier-->>Env: r_match + r_progress + r_simplicity + r_format
        Env->>Agent: mismatch summary + reward breakdown + history
        alt r_match > 0.93 or budget exhausted
            Env-->>Agent: done=True
        end
    end

After each step the agent receives an English mismatch summary (e.g. "predicted vy diverges after t=2 s; residual consistently negative") alongside the numeric reward breakdown, so it has something to act on in the next turn.

Reward

All reward is computed from scipy.odeint output — no model-in-the-loop scoring.

Step-wise (live env + GRPO)

Component	Weight	Formula	Purpose
`match`	0.50	R² (observed vs. predicted)	primary accuracy signal
`progress`	0.20	`max(0, r_match − r_match_prev)`	per-turn improvement shaping
`simplicity`	0.20	`1 − (operator_count / 12)`	prefer shorter equations
`format`	0.10	1 if parsed and simulated successfully	syntactic + numerical validity

GRPO-only additions

Two extra signals are added during training but not used in the live env:

match_dense = sqrt(R²) — gives a non-trivial gradient when raw R² is near zero (e.g. sqrt(0.05) ≈ 0.22).
correctness = 1 if R² ≥ 0.70, else 0 — a binary bonus that helps push past R² plateaus where the dense signal flattens.

Reward-hacking mitigations

Three failure modes found during development and how they were closed:

1. Parse-but-crash exploit. A valid-but-explosive equation (e.g. d2y/dt2 = exp(vy**10)) parses but makes odeint produce NaN. Without a fix, it earns format = 1.
→ format = 1 only if integration completes without NaN/inf.

2. Trivial-equation exploit. d2y/dt2 = 0 has zero operators, so simplicity = 1, earning 20% reward for a completely wrong trajectory.
→ simplicity = 0 unless r_match ≥ 0.10.

3. Progress signal in single-turn GRPO. Every GRPO training row starts with previous_r_match = 0, so progress = r_match — a redundant copy of the match signal that dilutes advantage estimates.
→ progress is excluded from the GRPO reward function set; it is only used in multi-turn live episodes.

Training: SFT → GRPO

Why SFT first

GRPO relies on reward variance across rollouts to estimate advantages. With a cold base model, ~80% of completions are unparseable (LaTeX, prose, malformed JSON) and most parseable ones crash the integrator, leaving near-zero variance and no useful gradient. The model needs to produce the right output format before RL can do anything meaningful with the physics signal.

SFT runs for 3 epochs on synthetic (prompt, ground_truth_equation) pairs generated from the environment. After SFT:

90% of completions parse and simulate successfully (up from ~20%).
Equations are in the ASCII ODE grammar the verifier expects.
The model has seen the right equation family for each system at least once.

SFT only establishes format. Parameter values are still wrong — that is what GRPO refines.

Step 1 — SFT warm-start

python -m physix.training.sft \
  --model Qwen/Qwen2.5-3B-Instruct \
  --output-dir runs/physix-3b-sft \
  --epochs 3 \
  --lora-r 32 \
  --instances-per-system 32 \
  --system-ids damped_spring

Runtime: ~5 min on L40S.

Step 2 — GRPO

python -m physix.training.loop \
  --model Qwen/Qwen2.5-3B-Instruct \
  --output-dir runs/physix-3b-rl \
  --num-steps 200 \
  --num-generations 4 \
  --lora-r 32 \
  --sft-checkpoint runs/physix-3b-sft/merged \
  --system-ids damped_spring \
  --push-to-hub \
  --hub-repo-id Pratyush-01/physix-3b-rl

Runtime: ~45 min on L40S.

Full cloud job

hf jobs uv run train/job_train_single.py \
    --image unsloth/unsloth:2026.3.8-pt2.9.0-vllm-0.16.0-cu12.8-studio-release \
    --flavor l40sx1 \
    --secrets HF_TOKEN \
    --secrets WANDB_API_KEY \
    -v hf://datasets/Pratyush-01/physix-live-src:/physix-live \
    --timeout 2h

Training Results

GRPO Loss (↓)	Total Reward (↑)

W&B runs: pratyush01/physix-live

Key observations from the run:

Total mean reward rises from ~3.3 to ~4.8 (+45%) over 200 steps with ±1σ variance shrinking — the policy is both improving and becoming more consistent.
The SFT warm-start gets format compliance high from step 1, so GRPO spends its budget improving R² rather than relearning JSON syntax.

Repository Layout

physix-live/
├── physix/
│   ├── models.py                 # Pydantic Action / Observation / State
│   ├── client.py                 # OpenEnv WebSocket client
│   ├── systems/                  # physical systems (3 trained, exposed via SUPPORTED_SYSTEMS)
│   │   ├── base.py               # PhysicalSystem ABC
│   │   ├── tier1.py              # FreeFall, SimplePendulum (+ extras for future work)
│   │   ├── tier2.py              # DampedSpring (+ extras for future work)
│   │   ├── tier3.py              # placeholders for future extensions, not exposed
│   │   └── registry.py
│   ├── verifier/
│   │   ├── parser.py             # SymPy whitelisted grammar
│   │   ├── simulator.py          # scipy.odeint forward simulation
│   │   ├── metrics.py            # per-step R²
│   │   ├── mismatch.py           # English residual summary
│   │   └── reward.py             # reward composition + hacking mitigations
│   ├── server/
│   │   ├── environment.py        # PhysiXEnvironment (OpenEnv subclass)
│   │   ├── interactive.py        # session-based REST router for the UI
│   │   └── app.py
│   └── training/
│       ├── prompt.py             # observation → prompt
│       ├── scorer.py             # cached single-completion scorer
│       ├── reward_fns.py         # TRL-compatible reward callables
│       ├── dataset.py            # GRPO dataset builder
│       ├── sft.py                # SFT warm-start
│       └── loop.py               # Unsloth + TRL GRPO loop
├── frontend/                     # React + TS + Tailwind demo UI
├── train/                        # HF Jobs launcher + Colab notebook
│   ├── submit.py                 # submit job via HfApi.run_uv_job
│   ├── job_train.py              # multi-system driver (in-container)
│   ├── job_train_single.py       # single-system driver (in-container)
│   ├── physix_train_colab.ipynb  # SFT → GRPO end-to-end notebook
│   └── sync-plots.sh             # mirror plots from model repo
├── tests/                        # ~30 tests
├── docs/
│   ├── plots/                    # committed loss / reward / per-component PNGs
│   └── writeup.md
├── Dockerfile                    # env Space build (FastAPI + built React UI)
├── openenv.yaml                  # OpenEnv manifest (name, runtime, app entrypoint)
└── pyproject.toml

Quick Start

One command from the repo root:

make dev

This starts the FastAPI backend on :8000 (deps auto-resolved by uv) and the Vite frontend on :5173. Open http://localhost:5173.

Connecting an LLM

The demo speaks to any OpenAI-compatible /v1/chat/completions endpoint — local Ollama, Hugging Face Inference Providers, OpenAI, vLLM, OpenRouter, etc. The "Connect an LLM" panel exposes:

Field	Purpose
Endpoint	Preset dropdown. Picks `base_url` + a default model id.
Model	Provider-native id (HF repo, Ollama tag, OpenAI name). Free-form.
Custom base URL	Shown when `Custom` is selected. Anything ending in `/v1`.
API key	Bearer token. Persisted per `base_url` in `localStorage`, never sent unless an episode runs.

Server-side env-var fallback (lets a deployed Space ship a sensible default without leaking secrets in the bundle):

URL family	Env var
`huggingface`	`HF_TOKEN`, then `HUGGINGFACE_API_KEY`
`openai.com`	`OPENAI_API_KEY`
`openrouter`	`OPENROUTER_API_KEY`
`localhost` / `127.0.0.1`	none (Ollama needs no key)

Side-by-side comparison

The default page is a two-column comparison: same trajectory, same hint, same seed, same verifier — two different models. The presets are wired to make the headline story self-evident:

A = Pratyush-01/physix-3b-rl via HF Inference Providers (the GRPO-trained model)
B = Qwen/Qwen2.5-3B-Instruct via HF Inference Providers (untrained baseline)

Drop in gpt-4o-mini on either side as a frontier reference, or swap to local Ollama for offline dev. The reward delta between the two columns is exactly what GRPO bought — no benchmark-prose necessary.

For the trained model on HF Inference Providers: weights are public, but the repo card needs inference: true and a serving provider (Featherless/Together/etc.) to have it loaded. If a visitor sees a 404 from the trained side, they can either bring up ollama serve locally and pull a quantised version, or fall back to Qwen/Qwen2.5-3B-Instruct on both sides.

Programmatic use

import asyncio
from physix import PhysiXEnv, PhysiXAction

async def main():
    async with PhysiXEnv(base_url="http://127.0.0.1:8000") as env:
        obs = await env.reset(system_id="damped_spring", seed=42)
        result = await env.step(
            PhysiXAction(
                equation="d2x/dt2 = -(k/m) * x - (b/m) * dx",
                params={"k": 4.0, "m": 1.0, "b": 0.36},
            )
        )
        print(result.observation.reward_breakdown)

asyncio.run(main())

pytest tests/

License

MIT.