--- title: PhysiX emoji: ⚛️ colorFrom: blue colorTo: purple sdk: docker app_port: 7860 pinned: false license: mit short_description: Equation discovery from noisy trajectories (RLVR) tags: - openenv - rlvr - physics - equation-discovery - ode --- # PhysiX — Equation Discovery via RLVR An [OpenEnv](https://github.com/openenv-hackathon/openenv) hackathon submission (Apr 2026). Given a noisy trajectory and a one-sentence hint, a language model iteratively proposes and refines an ODE that reproduces the observed motion. Reward comes entirely from `scipy.integrate.odeint` + per-step R² — no LLM-as-judge. --- ## Links | | | |---|---| | **Live demo (HF Space)** | https://huggingface.co/spaces/Pratyush-01/physix-live | | **Trained model** | https://huggingface.co/Pratyush-01/physix-3b-rl | | **Colab training notebook** | https://huggingface.co/spaces/Pratyush-01/physix-live/blob/main/train/physix_train_colab.ipynb | | **W&B training runs** | https://wandb.ai/pratyush01/physix-live | | **Blog post / writeup** | https://huggingface.co/spaces/Pratyush-01/physix-live/blob/main/docs/blog.md | | **Checkpoint repo** | https://huggingface.co/Pratyush-01/physix-3b-rl-ckpt | --- ## Environment ### Physical Systems Three physical systems, each with randomised parameters and initial conditions per episode. These are the systems the published model is trained and evaluated on. | System | Ground-truth equation | Notes | |--------|-----------------------|-------| | Free Fall | `d2y/dt2 = -g` | 1 parameter | | Simple Pendulum | `d2theta/dt2 = -(g/L)*sin(theta)` | transcendental | | Damped Spring | `d2x/dt2 = -(k/m)*x - (b/m)*dx` | damped oscillation | The verifier and DSL are designed to extend cleanly to richer dynamics (coupled oscillators, Lorenz, reaction–diffusion, N-body). Adding a system is a matter of writing a `PhysicalSystem` subclass and registering it — no changes to the reward, parser, or training loop. See [docs/blog.md → Future Scope](docs/blog.md) for the extensibility plan. ### Example Task ``` HINT: Mass on a spring, displaced 0.40 m and released. Visible amplitude decay over a few seconds. TRAJECTORY (t, x, dx): t=0.00 x= 0.40 dx= 0.00 t=0.50 x= 0.27 dx=-0.69 t=1.00 x=-0.06 dx=-0.84 t=1.50 x=-0.30 dx=-0.32 t=2.00 x=-0.30 dx= 0.34 t=3.00 x= 0.10 dx= 0.41 t=5.00 x=-0.04 dx=-0.10 ← amplitude visibly shrinking STATS: x_range=[-0.31, 0.40] decay_envelope≈e^(-0.18 t) ``` Target output: ```json { "equation": "d2x/dt2 = -(k/m) * x - (b/m) * dx", "params": {"k": 4.0, "m": 1.0, "b": 0.36}, "rationale": "Linear restoring force plus velocity-proportional damping; envelope decay matches b/(2m)." } ``` **Grammar:** operators `+ - * / **`, functions `sin cos tan exp log sqrt abs`, declared state variables and parameter names. Anything outside this scores `format = 0`. ### Episode Flow ```mermaid sequenceDiagram participant Agent participant Env as PhysiXEnvironment participant Sim as scipy.odeint participant Verifier Env->>Agent: reset() → trajectory + hint loop up to 8 turns Agent->>Env: step(equation + params + rationale) Env->>Sim: simulate hypothesis from t=0 Sim-->>Verifier: predicted trajectory Verifier-->>Env: r_match + r_progress + r_simplicity + r_format Env->>Agent: mismatch summary + reward breakdown + history alt r_match > 0.93 or budget exhausted Env-->>Agent: done=True end end ``` After each step the agent receives an English mismatch summary (e.g. *"predicted vy diverges after t=2 s; residual consistently negative"*) alongside the numeric reward breakdown, so it has something to act on in the next turn. --- ## Reward All reward is computed from `scipy.odeint` output — no model-in-the-loop scoring. ### Step-wise (live env + GRPO) | Component | Weight | Formula | Purpose | |-----------|:------:|---------|---------| | `match` | 0.50 | R² (observed vs. predicted) | primary accuracy signal | | `progress` | 0.20 | `max(0, r_match − r_match_prev)` | per-turn improvement shaping | | `simplicity` | 0.20 | `1 − (operator_count / 12)` | prefer shorter equations | | `format` | 0.10 | 1 if parsed **and** simulated successfully | syntactic + numerical validity | ### GRPO-only additions Two extra signals are added during training but not used in the live env: - **`match_dense = sqrt(R²)`** — gives a non-trivial gradient when raw R² is near zero (e.g. `sqrt(0.05) ≈ 0.22`). - **`correctness` = 1 if R² ≥ 0.70, else 0** — a binary bonus that helps push past R² plateaus where the dense signal flattens. ### Reward-hacking mitigations Three failure modes found during development and how they were closed: **1. Parse-but-crash exploit.** A valid-but-explosive equation (e.g. `d2y/dt2 = exp(vy**10)`) parses but makes `odeint` produce NaN. Without a fix, it earns `format = 1`. → `format = 1` only if integration completes without NaN/inf. **2. Trivial-equation exploit.** `d2y/dt2 = 0` has zero operators, so `simplicity = 1`, earning 20% reward for a completely wrong trajectory. → `simplicity = 0` unless `r_match ≥ 0.10`. **3. Progress signal in single-turn GRPO.** Every GRPO training row starts with `previous_r_match = 0`, so `progress = r_match` — a redundant copy of the match signal that dilutes advantage estimates. → `progress` is excluded from the GRPO reward function set; it is only used in multi-turn live episodes. --- ## Training: SFT → GRPO ### Why SFT first GRPO relies on reward variance across rollouts to estimate advantages. With a cold base model, ~80% of completions are unparseable (LaTeX, prose, malformed JSON) and most parseable ones crash the integrator, leaving near-zero variance and no useful gradient. The model needs to produce the right output format before RL can do anything meaningful with the physics signal. SFT runs for 3 epochs on synthetic `(prompt, ground_truth_equation)` pairs generated from the environment. After SFT: - >90% of completions parse and simulate successfully (up from ~20%). - Equations are in the ASCII ODE grammar the verifier expects. - The model has seen the right equation family for each system at least once. SFT only establishes format. Parameter values are still wrong — that is what GRPO refines. ### Step 1 — SFT warm-start ```bash python -m physix.training.sft \ --model Qwen/Qwen2.5-3B-Instruct \ --output-dir runs/physix-3b-sft \ --epochs 3 \ --lora-r 32 \ --instances-per-system 32 \ --system-ids damped_spring ``` Runtime: ~5 min on L40S. ### Step 2 — GRPO ```bash python -m physix.training.loop \ --model Qwen/Qwen2.5-3B-Instruct \ --output-dir runs/physix-3b-rl \ --num-steps 200 \ --num-generations 4 \ --lora-r 32 \ --sft-checkpoint runs/physix-3b-sft/merged \ --system-ids damped_spring \ --push-to-hub \ --hub-repo-id Pratyush-01/physix-3b-rl ``` Runtime: ~45 min on L40S. ### Full cloud job ```bash hf jobs uv run train/job_train_single.py \ --image unsloth/unsloth:2026.3.8-pt2.9.0-vllm-0.16.0-cu12.8-studio-release \ --flavor l40sx1 \ --secrets HF_TOKEN \ --secrets WANDB_API_KEY \ -v hf://datasets/Pratyush-01/physix-live-src:/physix-live \ --timeout 2h ``` --- ## Training Results | GRPO Loss (↓) | Total Reward (↑) | |:---:|:---:| | ![loss](docs/plots/loss.png) | ![reward](docs/plots/reward.png) | W&B runs: [pratyush01/physix-live](https://wandb.ai/pratyush01/physix-live) Key observations from the run: - Total mean reward rises from ~3.3 to ~4.8 (+45%) over 200 steps with ±1σ variance shrinking — the policy is both improving and becoming more consistent. - The SFT warm-start gets format compliance high from step 1, so GRPO spends its budget improving R² rather than relearning JSON syntax. --- ## Repository Layout ``` physix-live/ ├── physix/ │ ├── models.py # Pydantic Action / Observation / State │ ├── client.py # OpenEnv WebSocket client │ ├── systems/ # physical systems (3 trained, exposed via SUPPORTED_SYSTEMS) │ │ ├── base.py # PhysicalSystem ABC │ │ ├── tier1.py # FreeFall, SimplePendulum (+ extras for future work) │ │ ├── tier2.py # DampedSpring (+ extras for future work) │ │ ├── tier3.py # placeholders for future extensions, not exposed │ │ └── registry.py │ ├── verifier/ │ │ ├── parser.py # SymPy whitelisted grammar │ │ ├── simulator.py # scipy.odeint forward simulation │ │ ├── metrics.py # per-step R² │ │ ├── mismatch.py # English residual summary │ │ └── reward.py # reward composition + hacking mitigations │ ├── server/ │ │ ├── environment.py # PhysiXEnvironment (OpenEnv subclass) │ │ ├── interactive.py # session-based REST router for the UI │ │ └── app.py │ └── training/ │ ├── prompt.py # observation → prompt │ ├── scorer.py # cached single-completion scorer │ ├── reward_fns.py # TRL-compatible reward callables │ ├── dataset.py # GRPO dataset builder │ ├── sft.py # SFT warm-start │ └── loop.py # Unsloth + TRL GRPO loop ├── frontend/ # React + TS + Tailwind demo UI ├── train/ # HF Jobs launcher + Colab notebook │ ├── submit.py # submit job via HfApi.run_uv_job │ ├── job_train.py # multi-system driver (in-container) │ ├── job_train_single.py # single-system driver (in-container) │ ├── physix_train_colab.ipynb # SFT → GRPO end-to-end notebook │ └── sync-plots.sh # mirror plots from model repo ├── tests/ # ~30 tests ├── docs/ │ ├── plots/ # committed loss / reward / per-component PNGs │ └── writeup.md ├── Dockerfile # env Space build (FastAPI + built React UI) ├── openenv.yaml # OpenEnv manifest (name, runtime, app entrypoint) └── pyproject.toml ``` --- ## Quick Start One command from the repo root: ```bash make dev ``` This starts the FastAPI backend on `:8000` (deps auto-resolved by `uv`) and the Vite frontend on `:5173`. Open [http://localhost:5173](http://localhost:5173). ### Connecting an LLM The demo speaks to **any OpenAI-compatible `/v1/chat/completions` endpoint** — local Ollama, Hugging Face Inference Providers, OpenAI, vLLM, OpenRouter, etc. The "Connect an LLM" panel exposes: | Field | Purpose | |-------|---------| | **Endpoint** | Preset dropdown. Picks `base_url` + a default model id. | | **Model** | Provider-native id (HF repo, Ollama tag, OpenAI name). Free-form. | | **Custom base URL** | Shown when `Custom` is selected. Anything ending in `/v1`. | | **API key** | Bearer token. Persisted per `base_url` in `localStorage`, never sent unless an episode runs. | Server-side env-var fallback (lets a deployed Space ship a sensible default without leaking secrets in the bundle): | URL family | Env var | |---|---| | `*huggingface*` | `HF_TOKEN`, then `HUGGINGFACE_API_KEY` | | `*openai.com*` | `OPENAI_API_KEY` | | `*openrouter*` | `OPENROUTER_API_KEY` | | `localhost` / `127.0.0.1` | none (Ollama needs no key) | ### Side-by-side comparison The default page is a **two-column comparison**: same trajectory, same hint, same seed, same verifier — two different models. The presets are wired to make the headline story self-evident: - **A** = `Pratyush-01/physix-3b-rl` via HF Inference Providers (the GRPO-trained model) - **B** = `Qwen/Qwen2.5-3B-Instruct` via HF Inference Providers (untrained baseline) Drop in `gpt-4o-mini` on either side as a frontier reference, or swap to local Ollama for offline dev. The reward delta between the two columns is exactly what GRPO bought — no benchmark-prose necessary. > **For the trained model on HF Inference Providers**: weights are public, but the repo card needs `inference: true` and a serving provider (Featherless/Together/etc.) to have it loaded. If a visitor sees a 404 from the trained side, they can either bring up `ollama serve` locally and pull a quantised version, or fall back to `Qwen/Qwen2.5-3B-Instruct` on both sides. ### Programmatic use ```python import asyncio from physix import PhysiXEnv, PhysiXAction async def main(): async with PhysiXEnv(base_url="http://127.0.0.1:8000") as env: obs = await env.reset(system_id="damped_spring", seed=42) result = await env.step( PhysiXAction( equation="d2x/dt2 = -(k/m) * x - (b/m) * dx", params={"k": 4.0, "m": 1.0, "b": 0.36}, ) ) print(result.observation.reward_breakdown) asyncio.run(main()) ``` ```bash pytest tests/ ``` --- ## License MIT.