docs: add arXiv paper link and BibTeX citation

eea7ae8 verified about 18 hours ago

6.11 kB

	---
	library_name: peft
	pipeline_tag: text-generation
	tags:
	- lora
	- peft
	- judge
	- video-evaluation
	---

	# phyjudge-9B — Judge LoRA Adapter

	LoRA adapter trained as a judge model that scores generated videos against
	prompt-alignment, temporal, persistence, and 13 physical-law sub-rubrics.
	Released alongside the companion dataset
	[`NU-World-Model-Embodied-AI/phyground`](https://huggingface.co/datasets/NU-World-Model-Embodied-AI/phyground).

	Paper: [PhyGround: Benchmarking Physical Reasoning in Generative World Models (arXiv:2605.10806)](https://arxiv.org/abs/2605.10806)

	The base model identifier required to attach this adapter is recorded in
	`adapter_config.json` (`base_model_name_or_path`); the inference script
	reads it automatically.

	## Files

	\| File \| Purpose \|
	\| --- \| --- \|
	\| `adapter_config.json` \| PEFT/LoRA config (records base model id) \|
	\| `adapter_model.safetensors` \| LoRA weights (~167 MB) \|
	\| `additional_config.json` \| ms-swift extras (lora_dtype / lr ratios) \|
	\| `training_args.json` \| training hyperparameters \|
	\| `subq+human.yaml` \| prompt template used at training and inference time \|
	\| `infer.py` \| standalone end-to-end inference script \|

	## Setup

	```bash
	pip install "transformers>=4.49" peft accelerate pyyaml \
	"qwen-vl-utils[decord]" huggingface_hub
	```

	Loading the base model in bf16 needs roughly 24 GB of GPU memory.

	## Quickstart — Hugging Face Hub

	`infer.py` accepts either a local folder or a HF Hub repo id via
	`--adapter-dir`; the default value already points at this repo, so the
	following commands work without cloning anything.

	```bash
	# General axes (1–5 each): SA / PTV / persistence
	python infer.py \
	--video /path/to/video.mp4 \
	--caption "A ball rolls down a ramp and knocks over a block." \
	--metric SA

	# Physical-law axes (1–5 each): one of the 13 laws below
	python infer.py \
	--video /path/to/video.mp4 \
	--caption "A ball rolls down a ramp and knocks over a block." \
	--law gravity
	```

	`infer.py` will:

	1. Resolve `--adapter-dir` to a local directory (`huggingface_hub.snapshot_download`
	if it is a Hub id).
	2. Read `adapter_config.json` to find the base model and load it via
	`transformers`.
	3. Attach the LoRA adapter via PEFT.
	4. Render the scoring prompt from `subq+human.yaml`, plus the relevant
	sub-questions / per-law criterion (constants embedded in `infer.py`).
	5. Run greedy decoding with `--max-new-tokens 64` (matches training).
	6. Parse the JSON object and print the integer score.

	Output is a single JSON line:

	```json
	{"key": "gravity", "score": 4, "raw": "{\"gravity\": 4}"}
	```

	`--metric` choices: `SA`, `PTV`, `persistence`.
	`--law` choices: `gravity`, `inertia`, `momentum`, `impenetrability`,
	`collision`, `material`, `buoyancy`, `displacement`, `flow_dynamics`,
	`boundary_interaction`, `fluid_continuity`, `reflection`, `shadow`.

	Add `--print-prompt` to inspect the exact rendered system + user prompt
	before generation.

	## Programmatic use

	```python
	from pathlib import Path
	import torch

	from infer import (
	build_messages,
	build_prompt,
	decode_generated,
	load_model,
	load_yaml,
	parse_score,
	prepare_inputs,
	)

	processor, model, adapter_dir = load_model(
	"NU-World-Model-Embodied-AI/phyjudge-9B",
	dtype=torch.bfloat16,
	device_map="auto",
	)
	cfg = load_yaml(adapter_dir / "subq+human.yaml")

	system, user, key = build_prompt(
	cfg,
	caption="A ball rolls down a ramp and knocks over a block.",
	law="gravity",
	)
	messages = build_messages(system, user, Path("video.mp4"))
	inputs = prepare_inputs(
	processor,
	messages,
	next(model.parameters()).device,
	fps=2.0,
	max_pixels=360 * 640,
	)

	with torch.inference_mode():
	out = model.generate(**inputs, max_new_tokens=64, do_sample=False)

	raw = decode_generated(processor, inputs, out)
	print({"key": key, "score": parse_score(raw, key), "raw": raw})
	```

	## Prompt templates

	Both training and inference prompts are rendered from two sources:

	- `subq+human.yaml` — system prompt, the SA / PTV / persistence templates
	for the general axes, and the `physical_template` shared by all 13
	physical-law axes (with `{prompt}`, `{law}`, `{criteria}`,
	`{questions_block}` placeholders). Use `--print-prompt` to dump the
	fully rendered system + user prompt.
	- `infer.py` — the per-axis sub-question lists (`GENERAL_SUB_QUESTIONS`,
	`PHYSICAL_SUB_QUESTIONS`) and per-law criteria (`PHYSICAL_CRITERIA`)
	that are spliced into the YAML templates. Override any criterion at
	inference time with `--criteria "..."` instead of editing the source.

	The judge always replies with a single JSON object containing one key
	(the metric or law name) and an integer score in 1–5.

	## Training summary

	LoRA via PEFT (rank 32, α 64, dropout 0.05) over the language-tower
	linear layers, vision encoder frozen, bf16 + gradient checkpointing,
	AdamW lr = 1e-4 cosine, 1.0 epoch / 294 steps on the `subq+human` split
	(automatically derived sub-question judgements + human-rated samples).
	Full hyperparameters in `training_args.json` and `additional_config.json`;
	exact LoRA target regex and rank in `adapter_config.json`. Framework:
	ms-swift 4.1.2, PEFT 0.19.1, DeepSpeed ZeRO-2.

	See the companion dataset
	[`NU-World-Model-Embodied-AI/phyground`](https://huggingface.co/datasets/NU-World-Model-Embodied-AI/phyground)
	for prompts, physical-law tags, and example videos.

	## Citation

	```bibtex
	@misc{lin2026phygroundbenchmarkingphysicalreasoning,
	title={PhyGround: Benchmarking Physical Reasoning in Generative World Models},
	author={Juyi Lin and Arash Akbari and Yumei He and Lin Zhao and Haichao Zhang and Arman Akbari and Xingchen Xu and Zoe Y. Lu and Enfu Nan and Hokin Deng and Edmund Yeh and Sarah Ostadabbas and Yun Fu and Jennifer Dy and Pu Zhao and Yanzhi Wang},
	year={2026},
	eprint={2605.10806},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2605.10806},
	}
	```

	## License

	The base model is released by its respective authors; this LoRA adapter
	is released by NU-World-Model-Embodied-AI.