Spaces:
Sleeping
Sleeping
| title: PhysiX | |
| emoji: ⚛️ | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: docker | |
| app_port: 7860 | |
| pinned: false | |
| license: mit | |
| short_description: Equation discovery from noisy trajectories (RLVR) | |
| tags: | |
| - openenv | |
| - rlvr | |
| - physics | |
| - equation-discovery | |
| - ode | |
| base_path: /web | |
| # PhysiX — Equation Discovery via RLVR | |
| An [OpenEnv](https://github.com/openenv-hackathon/openenv) hackathon submission (Apr 2026). | |
| Given a noisy trajectory and a one-sentence hint, a language model iteratively proposes and refines an ODE that reproduces the observed motion. Reward comes entirely from `scipy.integrate.odeint` + per-step R² — no LLM-as-judge. | |
| --- | |
| ## Links | |
| | | | | |
| |---|---| | |
| | **Live demo (HF Space)** | https://huggingface.co/spaces/Pratyush-01/physix-live | | |
| | **Trained model** | https://huggingface.co/Pratyush-01/physix-3b-rl | | |
| | **Colab training notebook** | https://huggingface.co/spaces/Pratyush-01/physix-live/blob/main/train/physix_train_colab.ipynb | | |
| | **W&B training runs** | https://wandb.ai/pratyush01/physix-live | | |
| | **Blog post / writeup** | https://huggingface.co/spaces/Pratyush-01/physix-live/blob/main/docs/blog.md | | |
| | **Checkpoint repo** | https://huggingface.co/Pratyush-01/physix-3b-rl-ckpt | | |
| --- | |
| ## Environment | |
| ### Physical Systems — 3 Tiers | |
| 6 training systems across 3 difficulty tiers, plus 2 held-out Tier 3 systems. | |
| | Tier | System | Ground-truth equation | Notes | | |
| |------|--------|-----------------------|-------| | |
| | 1 | Free Fall | `d2y/dt2 = -g` | 1 parameter | | |
| | 1 | Free Fall with Drag | `d2y/dt2 = -g + k*vy**2` | nonlinear drag | | |
| | 1 | Simple Pendulum | `d2theta/dt2 = -(g/L)*sin(theta)` | transcendental | | |
| | 2 | Damped Pendulum | `d2theta/dt2 = -(g/L)*sin(theta) - b*dtheta` | 3 parameters | | |
| | 2 | Spring-Mass | `d2x/dt2 = -(k/m)*x` | parameter ratio | | |
| | 2 | Damped Spring | `d2x/dt2 = -(k/m)*x - (b/m)*dx` | damped oscillation | | |
| | 3 *(held-out)* | Projectile with Drag | 2-D coupled ODE | out-of-distribution | | |
| | 3 *(held-out)* | Charged Particle in B-field | 2-D Lorentz force | cross-product coupling | | |
| Parameters and initial conditions are randomised per episode. | |
| ### Example Task | |
| ``` | |
| HINT: Object dropped from altitude 58.3 m, mass 1.8 kg, in air. | |
| Air resistance may be non-negligible. | |
| TRAJECTORY (t, y, vy): | |
| t=0.00 y=58.30 vy= 0.00 | |
| t=0.50 y=56.89 vy=-5.44 | |
| t=1.00 y=51.91 vy=-9.21 | |
| t=2.00 y=35.70 vy=-13.88 | |
| t=3.00 y=16.11 vy=-16.02 | |
| t=5.00 y=-20.42 vy=-16.49 ← terminal velocity visible | |
| STATS: mean_vy=-10.85 std_vy=6.41 min_vy=-16.49 | |
| ``` | |
| Target output: | |
| ```json | |
| { | |
| "equation": "d2y/dt2 = -g + k * vy**2", | |
| "params": {"g": 9.81, "k": 0.047}, | |
| "rationale": "Quadratic drag; vy² is positive regardless of sign of vy." | |
| } | |
| ``` | |
| **Grammar:** operators `+ - * / **`, functions `sin cos tan exp log sqrt abs`, declared state variables and parameter names. Anything outside this scores `format = 0`. | |
| ### Episode Flow | |
| ```mermaid | |
| sequenceDiagram | |
| participant Agent | |
| participant Env as PhysiXEnvironment | |
| participant Sim as scipy.odeint | |
| participant Verifier | |
| Env->>Agent: reset() → trajectory + hint | |
| loop up to 8 turns | |
| Agent->>Env: step(equation + params + rationale) | |
| Env->>Sim: simulate hypothesis from t=0 | |
| Sim-->>Verifier: predicted trajectory | |
| Verifier-->>Env: r_match + r_progress + r_simplicity + r_format | |
| Env->>Agent: mismatch summary + reward breakdown + history | |
| alt r_match > 0.93 or budget exhausted | |
| Env-->>Agent: done=True | |
| end | |
| end | |
| ``` | |
| After each step the agent receives an English mismatch summary (e.g. *"predicted vy diverges after t=2 s; residual consistently negative"*) alongside the numeric reward breakdown, so it has something to act on in the next turn. | |
| --- | |
| ## Reward | |
| All reward is computed from `scipy.odeint` output — no model-in-the-loop scoring. | |
| ### Step-wise (live env + GRPO) | |
| | Component | Weight | Formula | Purpose | | |
| |-----------|:------:|---------|---------| | |
| | `match` | 0.50 | R² (observed vs. predicted) | primary accuracy signal | | |
| | `progress` | 0.20 | `max(0, r_match − r_match_prev)` | per-turn improvement shaping | | |
| | `simplicity` | 0.20 | `1 − (operator_count / 12)` | prefer shorter equations | | |
| | `format` | 0.10 | 1 if parsed **and** simulated successfully | syntactic + numerical validity | | |
| ### GRPO-only additions | |
| Two extra signals are added during training but not used in the live env: | |
| - **`match_dense = sqrt(R²)`** — gives a non-trivial gradient when raw R² is near zero (e.g. `sqrt(0.05) ≈ 0.22`). | |
| - **`correctness` = 1 if R² ≥ 0.70, else 0** — a binary bonus that helps push past R² plateaus where the dense signal flattens. | |
| ### Reward-hacking mitigations | |
| Three failure modes found during development and how they were closed: | |
| **1. Parse-but-crash exploit.** | |
| A valid-but-explosive equation (e.g. `d2y/dt2 = exp(vy**10)`) parses but makes `odeint` produce NaN. Without a fix, it earns `format = 1`. | |
| → `format = 1` only if integration completes without NaN/inf. | |
| **2. Trivial-equation exploit.** | |
| `d2y/dt2 = 0` has zero operators, so `simplicity = 1`, earning 20% reward for a completely wrong trajectory. | |
| → `simplicity = 0` unless `r_match ≥ 0.10`. | |
| **3. Progress signal in single-turn GRPO.** | |
| Every GRPO training row starts with `previous_r_match = 0`, so `progress = r_match` — a redundant copy of the match signal that dilutes advantage estimates. | |
| → `progress` is excluded from the GRPO reward function set; it is only used in multi-turn live episodes. | |
| --- | |
| ## Training: SFT → GRPO | |
| ### Why SFT first | |
| GRPO relies on reward variance across rollouts to estimate advantages. With a cold base model, ~80% of completions are unparseable (LaTeX, prose, malformed JSON) and most parseable ones crash the integrator, leaving near-zero variance and no useful gradient. The model needs to produce the right output format before RL can do anything meaningful with the physics signal. | |
| SFT runs for 3 epochs on synthetic `(prompt, ground_truth_equation)` pairs generated from the environment. After SFT: | |
| - >90% of completions parse and simulate successfully (up from ~20%). | |
| - Equations are in the ASCII ODE grammar the verifier expects. | |
| - The model has seen the right equation family for each system at least once. | |
| SFT only establishes format. Parameter values are still wrong — that is what GRPO refines. | |
| ### Step 1 — SFT warm-start | |
| ```bash | |
| python -m physix.training.sft \ | |
| --model Qwen/Qwen2.5-3B-Instruct \ | |
| --output-dir runs/physix-3b-sft \ | |
| --epochs 3 \ | |
| --lora-r 32 \ | |
| --instances-per-system 32 \ | |
| --system-ids damped_spring | |
| ``` | |
| Runtime: ~5 min on L40S. | |
| ### Step 2 — GRPO | |
| ```bash | |
| python -m physix.training.loop \ | |
| --model Qwen/Qwen2.5-3B-Instruct \ | |
| --output-dir runs/physix-3b-rl \ | |
| --num-steps 200 \ | |
| --num-generations 4 \ | |
| --lora-r 32 \ | |
| --sft-checkpoint runs/physix-3b-sft/merged \ | |
| --system-ids damped_spring \ | |
| --push-to-hub \ | |
| --hub-repo-id Pratyush-01/physix-3b-rl | |
| ``` | |
| Runtime: ~45 min on L40S. | |
| ### Full cloud job | |
| ```bash | |
| hf jobs uv run train/job_train_single.py \ | |
| --image unsloth/unsloth:2026.3.8-pt2.9.0-vllm-0.16.0-cu12.8-studio-release \ | |
| --flavor l40sx1 \ | |
| --secrets HF_TOKEN \ | |
| --secrets WANDB_API_KEY \ | |
| -v hf://datasets/Pratyush-01/physix-live-src:/physix-live \ | |
| --timeout 2h | |
| ``` | |
| --- | |
| ## Training Results | |
| | GRPO Loss (↓) | Total Reward (↑) | | |
| |:---:|:---:| | |
| |  |  | | |
| | Per-component reward breakdown | | |
| |:---:| | |
| |  | | |
| W&B runs: [pratyush01/physix-live](https://wandb.ai/pratyush01/physix-live) | |
| Key observations from the run: | |
| - `reward/format` jumps to ~0.9 in the first 10 steps and holds there — the SFT warm-start does its job. | |
| - `reward/match_dense` (√R²) and `reward/correctness` both trend from ~0.6 to ~0.9–1.0 over 200 steps — the physics simulation is driving the gradient. | |
| - `reward/match` (raw R²) tracks match_dense and converges to ~0.95+ by step 150, indicating the model reliably proposes equations that fit the trajectory. | |
| - `reward/simplicity` rises gradually; the gate at R²≥0.10 prevents trivial-equation gaming (`d2y/dt2 = 0` can't earn simplicity reward). | |
| - Total mean reward rises from ~3.3 to ~4.8 (+45%) with ±1σ variance shrinking — the policy is both improving and becoming more consistent. | |
| What we don't claim: | |
| - The model generalises well to Tier 3 systems without further training. | |
| - A 3B model is competitive with frontier models on this task. | |
| - The reward improvements translate to meaningful physics understanding beyond the training distribution. | |
| --- | |
| ## Repository Layout | |
| ``` | |
| physix-live/ | |
| ├── physix/ | |
| │ ├── models.py # Pydantic Action / Observation / State | |
| │ ├── client.py # OpenEnv WebSocket client | |
| │ ├── systems/ # physical systems | |
| │ │ ├── base.py # PhysicalSystem ABC | |
| │ │ ├── tier1.py # FreeFall, FreeFallWithDrag, SimplePendulum | |
| │ │ ├── tier2.py # DampedPendulum, SpringMass, DampedSpring | |
| │ │ ├── tier3.py # ProjectileWithDrag, ChargedInBField (held out) | |
| │ │ └── registry.py | |
| │ ├── verifier/ | |
| │ │ ├── parser.py # SymPy whitelisted grammar | |
| │ │ ├── simulator.py # scipy.odeint forward simulation | |
| │ │ ├── metrics.py # per-step R² | |
| │ │ ├── mismatch.py # English residual summary | |
| │ │ └── reward.py # reward composition + hacking mitigations | |
| │ ├── server/ | |
| │ │ ├── environment.py # PhysiXEnvironment (OpenEnv subclass) | |
| │ │ ├── interactive.py # session-based REST router for the UI | |
| │ │ └── app.py | |
| │ └── training/ | |
| │ ├── prompt.py # observation → prompt | |
| │ ├── scorer.py # cached single-completion scorer | |
| │ ├── reward_fns.py # TRL-compatible reward callables | |
| │ ├── dataset.py # GRPO dataset builder | |
| │ ├── sft.py # SFT warm-start | |
| │ └── loop.py # Unsloth + TRL GRPO loop | |
| ├── frontend/ # React + TS + Tailwind demo UI | |
| ├── train/ # HF Jobs launcher + Colab notebook | |
| │ ├── submit.py # submit job via HfApi.run_uv_job | |
| │ ├── job_train.py # multi-system driver (in-container) | |
| │ ├── job_train_single.py # single-system driver (in-container) | |
| │ ├── physix_train_colab.ipynb # SFT → GRPO end-to-end notebook | |
| │ └── sync-plots.sh # mirror plots from model repo | |
| ├── tests/ # ~30 tests | |
| ├── docs/ | |
| │ ├── plots/ # committed loss / reward / per-component PNGs | |
| │ └── writeup.md | |
| ├── Dockerfile # env Space build (FastAPI + built React UI) | |
| ├── openenv.yaml # OpenEnv manifest (name, runtime, app entrypoint) | |
| └── pyproject.toml | |
| ``` | |
| --- | |
| ## Quick Start | |
| One command from the repo root: | |
| ```bash | |
| make dev | |
| ``` | |
| This starts the FastAPI backend on `:8000` (deps auto-resolved by `uv`) and the Vite frontend on `:5173`. Open [http://localhost:5173](http://localhost:5173). | |
| ### Connecting an LLM | |
| The demo speaks to **any OpenAI-compatible `/v1/chat/completions` endpoint** — local Ollama, Hugging Face Inference Providers, OpenAI, vLLM, OpenRouter, etc. The "Connect an LLM" panel exposes: | |
| | Field | Purpose | | |
| |-------|---------| | |
| | **Endpoint** | Preset dropdown. Picks `base_url` + a default model id. | | |
| | **Model** | Provider-native id (HF repo, Ollama tag, OpenAI name). Free-form. | | |
| | **Custom base URL** | Shown when `Custom` is selected. Anything ending in `/v1`. | | |
| | **API key** | Bearer token. Persisted per `base_url` in `localStorage`, never sent unless an episode runs. | | |
| Server-side env-var fallback (lets a deployed Space ship a sensible default without leaking secrets in the bundle): | |
| | URL family | Env var | | |
| |---|---| | |
| | `*huggingface*` | `HF_TOKEN`, then `HUGGINGFACE_API_KEY` | | |
| | `*openai.com*` | `OPENAI_API_KEY` | | |
| | `*openrouter*` | `OPENROUTER_API_KEY` | | |
| | `localhost` / `127.0.0.1` | none (Ollama needs no key) | | |
| ### Side-by-side comparison | |
| The default page is a **two-column comparison**: same trajectory, same hint, same seed, same verifier — two different models. The presets are wired to make the headline story self-evident: | |
| - **A** = `Pratyush-01/physix-3b-rl` via HF Inference Providers (the GRPO-trained model) | |
| - **B** = `Qwen/Qwen2.5-3B-Instruct` via HF Inference Providers (untrained baseline) | |
| Drop in `gpt-4o-mini` on either side as a frontier reference, or swap to local Ollama for offline dev. The reward delta between the two columns is exactly what GRPO bought — no benchmark-prose necessary. | |
| > **For the trained model on HF Inference Providers**: weights are public, but the repo card needs `inference: true` and a serving provider (Featherless/Together/etc.) to have it loaded. If a visitor sees a 404 from the trained side, they can either bring up `ollama serve` locally and pull a quantised version, or fall back to `Qwen/Qwen2.5-3B-Instruct` on both sides. | |
| ### Programmatic use | |
| ```python | |
| import asyncio | |
| from physix import PhysiXEnv, PhysiXAction | |
| async def main(): | |
| async with PhysiXEnv(base_url="http://127.0.0.1:8000") as env: | |
| obs = await env.reset(system_id="free_fall_drag", seed=42) | |
| result = await env.step( | |
| PhysiXAction(equation="d2y/dt2 = -g + k * vy**2", params={"g": 9.81, "k": 0.05}) | |
| ) | |
| print(result.observation.reward_breakdown) | |
| asyncio.run(main()) | |
| ``` | |
| ```bash | |
| pytest tests/ | |
| ``` | |
| --- | |
| ## License | |
| MIT. | |