phyjudge-9B / README.md
juyil's picture
docs: add arXiv paper link and BibTeX citation
eea7ae8 verified
metadata
library_name: peft
pipeline_tag: text-generation
tags:
  - lora
  - peft
  - judge
  - video-evaluation

phyjudge-9B — Judge LoRA Adapter

LoRA adapter trained as a judge model that scores generated videos against prompt-alignment, temporal, persistence, and 13 physical-law sub-rubrics. Released alongside the companion dataset NU-World-Model-Embodied-AI/phyground.

Paper: PhyGround: Benchmarking Physical Reasoning in Generative World Models (arXiv:2605.10806)

The base model identifier required to attach this adapter is recorded in adapter_config.json (base_model_name_or_path); the inference script reads it automatically.

Files

File Purpose
adapter_config.json PEFT/LoRA config (records base model id)
adapter_model.safetensors LoRA weights (~167 MB)
additional_config.json ms-swift extras (lora_dtype / lr ratios)
training_args.json training hyperparameters
subq+human.yaml prompt template used at training and inference time
infer.py standalone end-to-end inference script

Setup

pip install "transformers>=4.49" peft accelerate pyyaml \
            "qwen-vl-utils[decord]" huggingface_hub

Loading the base model in bf16 needs roughly 24 GB of GPU memory.

Quickstart — Hugging Face Hub

infer.py accepts either a local folder or a HF Hub repo id via --adapter-dir; the default value already points at this repo, so the following commands work without cloning anything.

# General axes (1–5 each): SA / PTV / persistence
python infer.py \
  --video /path/to/video.mp4 \
  --caption "A ball rolls down a ramp and knocks over a block." \
  --metric SA

# Physical-law axes (1–5 each): one of the 13 laws below
python infer.py \
  --video /path/to/video.mp4 \
  --caption "A ball rolls down a ramp and knocks over a block." \
  --law gravity

infer.py will:

  1. Resolve --adapter-dir to a local directory (huggingface_hub.snapshot_download if it is a Hub id).
  2. Read adapter_config.json to find the base model and load it via transformers.
  3. Attach the LoRA adapter via PEFT.
  4. Render the scoring prompt from subq+human.yaml, plus the relevant sub-questions / per-law criterion (constants embedded in infer.py).
  5. Run greedy decoding with --max-new-tokens 64 (matches training).
  6. Parse the JSON object and print the integer score.

Output is a single JSON line:

{"key": "gravity", "score": 4, "raw": "{\"gravity\": 4}"}

--metric choices: SA, PTV, persistence. --law choices: gravity, inertia, momentum, impenetrability, collision, material, buoyancy, displacement, flow_dynamics, boundary_interaction, fluid_continuity, reflection, shadow.

Add --print-prompt to inspect the exact rendered system + user prompt before generation.

Programmatic use

from pathlib import Path
import torch

from infer import (
    build_messages,
    build_prompt,
    decode_generated,
    load_model,
    load_yaml,
    parse_score,
    prepare_inputs,
)

processor, model, adapter_dir = load_model(
    "NU-World-Model-Embodied-AI/phyjudge-9B",
    dtype=torch.bfloat16,
    device_map="auto",
)
cfg = load_yaml(adapter_dir / "subq+human.yaml")

system, user, key = build_prompt(
    cfg,
    caption="A ball rolls down a ramp and knocks over a block.",
    law="gravity",
)
messages = build_messages(system, user, Path("video.mp4"))
inputs = prepare_inputs(
    processor,
    messages,
    next(model.parameters()).device,
    fps=2.0,
    max_pixels=360 * 640,
)

with torch.inference_mode():
    out = model.generate(**inputs, max_new_tokens=64, do_sample=False)

raw = decode_generated(processor, inputs, out)
print({"key": key, "score": parse_score(raw, key), "raw": raw})

Prompt templates

Both training and inference prompts are rendered from two sources:

  • subq+human.yaml — system prompt, the SA / PTV / persistence templates for the general axes, and the physical_template shared by all 13 physical-law axes (with {prompt}, {law}, {criteria}, {questions_block} placeholders). Use --print-prompt to dump the fully rendered system + user prompt.
  • infer.py — the per-axis sub-question lists (GENERAL_SUB_QUESTIONS, PHYSICAL_SUB_QUESTIONS) and per-law criteria (PHYSICAL_CRITERIA) that are spliced into the YAML templates. Override any criterion at inference time with --criteria "..." instead of editing the source.

The judge always replies with a single JSON object containing one key (the metric or law name) and an integer score in 1–5.

Training summary

LoRA via PEFT (rank 32, α 64, dropout 0.05) over the language-tower linear layers, vision encoder frozen, bf16 + gradient checkpointing, AdamW lr = 1e-4 cosine, 1.0 epoch / 294 steps on the subq+human split (automatically derived sub-question judgements + human-rated samples). Full hyperparameters in training_args.json and additional_config.json; exact LoRA target regex and rank in adapter_config.json. Framework: ms-swift 4.1.2, PEFT 0.19.1, DeepSpeed ZeRO-2.

See the companion dataset NU-World-Model-Embodied-AI/phyground for prompts, physical-law tags, and example videos.

Citation

@misc{lin2026phygroundbenchmarkingphysicalreasoning,
      title={PhyGround: Benchmarking Physical Reasoning in Generative World Models},
      author={Juyi Lin and Arash Akbari and Yumei He and Lin Zhao and Haichao Zhang and Arman Akbari and Xingchen Xu and Zoe Y. Lu and Enfu Nan and Hokin Deng and Edmund Yeh and Sarah Ostadabbas and Yun Fu and Jennifer Dy and Pu Zhao and Yanzhi Wang},
      year={2026},
      eprint={2605.10806},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.10806},
}

License

The base model is released by its respective authors; this LoRA adapter is released by NU-World-Model-Embodied-AI.