juyil commited on
Commit
4aee60a
·
verified ·
1 Parent(s): bd06ea3

Initial release: phyjudge-9B LoRA judge adapter

Browse files
README.md ADDED
@@ -0,0 +1,166 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: peft
3
+ pipeline_tag: text-generation
4
+ tags:
5
+ - lora
6
+ - peft
7
+ - judge
8
+ - video-evaluation
9
+ ---
10
+
11
+ # phyjudge-9B — Judge LoRA Adapter
12
+
13
+ LoRA adapter trained as a judge model that scores generated videos against
14
+ prompt-alignment, temporal, persistence, and 13 physical-law sub-rubrics.
15
+ Released alongside the companion dataset
16
+ [`NU-World-Model-Embodied-AI/phyground`](https://huggingface.co/datasets/NU-World-Model-Embodied-AI/phyground).
17
+
18
+ The base model identifier required to attach this adapter is recorded in
19
+ `adapter_config.json` (`base_model_name_or_path`); the inference script
20
+ reads it automatically.
21
+
22
+ ## Files
23
+
24
+ | File | Purpose |
25
+ | --- | --- |
26
+ | `adapter_config.json` | PEFT/LoRA config (records base model id) |
27
+ | `adapter_model.safetensors` | LoRA weights (~167 MB) |
28
+ | `additional_config.json` | ms-swift extras (lora_dtype / lr ratios) |
29
+ | `training_args.json` | sanitized training hyperparameters |
30
+ | `subq+human.yaml` | prompt template used at training and inference time |
31
+ | `infer.py` | standalone end-to-end inference script |
32
+
33
+ ## Setup
34
+
35
+ ```bash
36
+ pip install "transformers>=4.49" peft accelerate pyyaml \
37
+ "qwen-vl-utils[decord]" huggingface_hub
38
+ ```
39
+
40
+ Loading the base model in bf16 needs roughly 24 GB of GPU memory.
41
+
42
+ ## Quickstart — Hugging Face Hub
43
+
44
+ `infer.py` accepts either a local folder or a HF Hub repo id via
45
+ `--adapter-dir`; the default value already points at this repo, so the
46
+ following commands work without cloning anything.
47
+
48
+ ```bash
49
+ # General axes (1–5 each): SA / PTV / persistence
50
+ python infer.py \
51
+ --video /path/to/video.mp4 \
52
+ --caption "A ball rolls down a ramp and knocks over a block." \
53
+ --metric SA
54
+
55
+ # Physical-law axes (1–5 each): one of the 13 laws below
56
+ python infer.py \
57
+ --video /path/to/video.mp4 \
58
+ --caption "A ball rolls down a ramp and knocks over a block." \
59
+ --law gravity
60
+ ```
61
+
62
+ `infer.py` will:
63
+
64
+ 1. Resolve `--adapter-dir` to a local directory (`huggingface_hub.snapshot_download`
65
+ if it is a Hub id).
66
+ 2. Read `adapter_config.json` to find the base model and load it via
67
+ `transformers`.
68
+ 3. Attach the LoRA adapter via PEFT.
69
+ 4. Render the scoring prompt from `subq+human.yaml`, plus the relevant
70
+ sub-questions / per-law criterion (constants embedded in `infer.py`).
71
+ 5. Run greedy decoding with `--max-new-tokens 64` (matches training).
72
+ 6. Parse the JSON object and print the integer score.
73
+
74
+ Output is a single JSON line:
75
+
76
+ ```json
77
+ {"key": "gravity", "score": 4, "raw": "{\"gravity\": 4}"}
78
+ ```
79
+
80
+ `--metric` choices: `SA`, `PTV`, `persistence`.
81
+ `--law` choices: `gravity`, `inertia`, `momentum`, `impenetrability`,
82
+ `collision`, `material`, `buoyancy`, `displacement`, `flow_dynamics`,
83
+ `boundary_interaction`, `fluid_continuity`, `reflection`, `shadow`.
84
+
85
+ Add `--print-prompt` to inspect the exact rendered system + user prompt
86
+ before generation.
87
+
88
+ ## Programmatic use
89
+
90
+ ```python
91
+ from pathlib import Path
92
+ import torch
93
+
94
+ from infer import (
95
+ build_messages,
96
+ build_prompt,
97
+ decode_generated,
98
+ load_model,
99
+ load_yaml,
100
+ parse_score,
101
+ prepare_inputs,
102
+ )
103
+
104
+ processor, model, adapter_dir = load_model(
105
+ "NU-World-Model-Embodied-AI/phyjudge-9B",
106
+ dtype=torch.bfloat16,
107
+ device_map="auto",
108
+ )
109
+ cfg = load_yaml(adapter_dir / "subq+human.yaml")
110
+
111
+ system, user, key = build_prompt(
112
+ cfg,
113
+ caption="A ball rolls down a ramp and knocks over a block.",
114
+ law="gravity",
115
+ )
116
+ messages = build_messages(system, user, Path("video.mp4"))
117
+ inputs = prepare_inputs(
118
+ processor,
119
+ messages,
120
+ next(model.parameters()).device,
121
+ fps=2.0,
122
+ max_pixels=360 * 640,
123
+ )
124
+
125
+ with torch.inference_mode():
126
+ out = model.generate(**inputs, max_new_tokens=64, do_sample=False)
127
+
128
+ raw = decode_generated(processor, inputs, out)
129
+ print({"key": key, "score": parse_score(raw, key), "raw": raw})
130
+ ```
131
+
132
+ ## Prompt templates
133
+
134
+ Both training and inference prompts are rendered from two sources:
135
+
136
+ - `subq+human.yaml` — system prompt, the SA / PTV / persistence templates
137
+ for the general axes, and the `physical_template` shared by all 13
138
+ physical-law axes (with `{prompt}`, `{law}`, `{criteria}`,
139
+ `{questions_block}` placeholders). Use `--print-prompt` to dump the
140
+ fully rendered system + user prompt.
141
+ - `infer.py` — the per-axis sub-question lists (`GENERAL_SUB_QUESTIONS`,
142
+ `PHYSICAL_SUB_QUESTIONS`) and per-law criteria (`PHYSICAL_CRITERIA`)
143
+ that are spliced into the YAML templates. Override any criterion at
144
+ inference time with `--criteria "..."` instead of editing the source.
145
+
146
+ The judge always replies with a single JSON object containing one key
147
+ (the metric or law name) and an integer score in 1–5.
148
+
149
+ ## Training summary
150
+
151
+ LoRA via PEFT (rank 32, α 64, dropout 0.05) over the language-tower
152
+ linear layers, vision encoder frozen, bf16 + gradient checkpointing,
153
+ AdamW lr = 1e-4 cosine, 1.0 epoch / 294 steps on the `subq+human` split
154
+ (automatically derived sub-question judgements + human-rated samples).
155
+ Full hyperparameters in `training_args.json` and `additional_config.json`;
156
+ exact LoRA target regex and rank in `adapter_config.json`. Framework:
157
+ ms-swift 4.1.2, PEFT 0.19.1, DeepSpeed ZeRO-2.
158
+
159
+ See the companion dataset
160
+ [`NU-World-Model-Embodied-AI/phyground`](https://huggingface.co/datasets/NU-World-Model-Embodied-AI/phyground)
161
+ for prompts, physical-law tags, and example videos.
162
+
163
+ ## License
164
+
165
+ The base model is released by its respective authors; this LoRA adapter
166
+ is released by NU-World-Model-Embodied-AI.
adapter_config.json ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alora_invocation_tokens": null,
3
+ "alpha_pattern": {},
4
+ "arrow_config": null,
5
+ "auto_mapping": null,
6
+ "base_model_name_or_path": "Qwen/Qwen3.5-9B",
7
+ "bias": "none",
8
+ "corda_config": null,
9
+ "ensure_weight_tying": false,
10
+ "eva_config": null,
11
+ "exclude_modules": null,
12
+ "fan_in_fan_out": false,
13
+ "inference_mode": true,
14
+ "init_lora_weights": true,
15
+ "layer_replication": null,
16
+ "layers_pattern": null,
17
+ "layers_to_transform": null,
18
+ "loftq_config": {},
19
+ "lora_alpha": 64,
20
+ "lora_bias": false,
21
+ "lora_dropout": 0.05,
22
+ "lora_ga_config": null,
23
+ "megatron_config": null,
24
+ "megatron_core": "megatron.core",
25
+ "modules_to_save": [],
26
+ "peft_type": "LORA",
27
+ "peft_version": "0.19.1",
28
+ "qalora_group_size": 16,
29
+ "r": 32,
30
+ "rank_pattern": {},
31
+ "revision": null,
32
+ "target_modules": "^(model\\.language_model(?=\\.).*\\.(o_proj|out_proj|in_proj_qkv|gate_proj|k_proj|in_proj_z|down_proj|v_proj|q_proj|in_proj_b|up_proj|in_proj_a)|model\\.visual\\.merger(?=\\.).*\\.(linear_fc2|linear_fc1))$",
33
+ "target_parameters": null,
34
+ "task_type": "CAUSAL_LM",
35
+ "trainable_token_indices": null,
36
+ "use_bdlora": null,
37
+ "use_dora": false,
38
+ "use_qalora": false,
39
+ "use_rslora": false
40
+ }
adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:880543b4bbc572e58980d58690b23ea9262d7ecbdd980bdf5a9139dfe022c881
3
+ size 174336432
additional_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"lora_dtype": null, "lorap_lr_ratio": null, "lorap_emb_lr": 1e-06}
infer.py ADDED
@@ -0,0 +1,413 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Run inference with the judge LoRA adapter.
2
+
3
+ The script can either load files from a local directory or pull them
4
+ directly from the Hugging Face Hub. By default it points at the
5
+ companion repository ``NU-World-Model-Embodied-AI/phyjudge-9B``:
6
+
7
+ # From the Hub (no clone needed):
8
+ python infer.py --video demo.mp4 --caption "A ball rolls down a ramp." --metric SA
9
+ python infer.py --video demo.mp4 --caption "A ball rolls down a ramp." --law gravity
10
+
11
+ # From a local clone of the model repo:
12
+ python infer.py --adapter-dir /path/to/local/clone --video demo.mp4 \
13
+ --caption "A ball rolls down a ramp." --law gravity
14
+
15
+ It loads:
16
+ - adapter_config.json to find the base model
17
+ - adapter_model.safetensors through PEFT
18
+ - subq+human.yaml to render the scoring prompt
19
+ """
20
+
21
+ from __future__ import annotations
22
+
23
+ import argparse
24
+ import json
25
+ import re
26
+ from pathlib import Path
27
+ from typing import Any
28
+
29
+ import torch
30
+ import yaml
31
+ from peft import PeftModel
32
+ from transformers import AutoProcessor
33
+
34
+
35
+ GENERAL_SUB_QUESTIONS: dict[str, list[str]] = {
36
+ "SA": [
37
+ "Are the main objects in the caption present in the video?",
38
+ "Are the key actions or interactions from the caption visible?",
39
+ "Are important scene attributes and relationships preserved?",
40
+ "Does the video avoid major contradictions to the caption?",
41
+ ],
42
+ "PTV": [
43
+ "Do causes appear before their effects?",
44
+ "Do physical events unfold in a plausible temporal order?",
45
+ "Are motion transitions continuous rather than abrupt jumps or loops?",
46
+ "Does the sequence avoid impossible reversals or repeated resets?",
47
+ ],
48
+ "persistence": [
49
+ "Do objects maintain consistent existence throughout the video?",
50
+ "Do objects keep a stable shape, size, color, and texture?",
51
+ "Do objects avoid disappearing, appearing, or transforming unexpectedly?",
52
+ "Do objects preserve identity through motion and brief occlusion?",
53
+ ],
54
+ }
55
+
56
+
57
+ PHYSICAL_CRITERIA: dict[str, str] = {
58
+ "gravity": "Do unsupported objects fall downward? Do thrown objects follow a curved trajectory? Does poured liquid fall with gravity?",
59
+ "inertia": "Do stationary objects remain still unless acted upon? Do moving objects maintain their motion unless stopped by friction, collision, or an obstacle?",
60
+ "momentum": "After collision, push, or pull, is the direction of motion reasonable? Ignore speed magnitude.",
61
+ "impenetrability": "Do objects maintain impenetrability -- no passing through each other?",
62
+ "collision": "After impact, is there reasonable bounce/shatter/deformation? Does response match impact force?",
63
+ "material": "Does each material respond according to its properties? (glass shatters, rubber bounces, metal is rigid, cloth deforms softly, etc.)",
64
+ "buoyancy": "Do dense objects sink? Do wood/plastic float?",
65
+ "displacement": "When you add more liquid or put an object into it, does the liquid level rise in a realistic way? Does it overflow when full?",
66
+ "flow_dynamics": "Does the liquid's overall motion behave realistically over time -- flowing along surfaces, spreading, draining naturally?",
67
+ "boundary_interaction": "When the liquid hits a boundary such as a rock face, container wall, or floor, does it respond realistically? Do local splash, rebound, or split patterns on impact look physically plausible?",
68
+ "fluid_continuity": "Does the liquid avoid disappearing or appearing out of nowhere? Small splashes that briefly break apart are okay.",
69
+ "reflection": "Does the reflection roughly match objects and colors in the scene, and avoid completely unrelated content?",
70
+ "shadow": "Are shadow directions consistent with light source? Do shadows move with objects?",
71
+ }
72
+
73
+
74
+ PHYSICAL_SUB_QUESTIONS: dict[str, list[str]] = {
75
+ "gravity": [
76
+ "Do unsupported objects or liquids move downward over time?",
77
+ "Do thrown or falling objects follow a plausible gravity-driven path?",
78
+ "Does the video avoid objects floating or rising without support?",
79
+ ],
80
+ "inertia": [
81
+ "Do stationary objects remain still unless a visible force acts on them?",
82
+ "Do moving objects continue plausibly until friction, collision, or an obstacle changes their motion?",
83
+ "Does the video avoid unexplained starts, stops, or direction changes?",
84
+ ],
85
+ "momentum": [
86
+ "After contact, push, pull, or collision, are motion directions plausible?",
87
+ "Does the reacting object move in a direction consistent with the interaction?",
88
+ "Does the video avoid impossible reversals or unrelated motion changes?",
89
+ ],
90
+ "impenetrability": [
91
+ "Do solid objects avoid passing through one another?",
92
+ "Do contacts and overlaps remain physically plausible?",
93
+ "Does the video avoid obvious clipping or penetration artifacts?",
94
+ ],
95
+ "collision": [
96
+ "Does impact cause a plausible bounce, break, deformation, or transfer of motion?",
97
+ "Is the response direction consistent with the collision?",
98
+ "Does the response avoid being much too weak, too strong, or unrelated to the impact?",
99
+ ],
100
+ "material": [
101
+ "Do objects respond consistently with their apparent material?",
102
+ "Are rigid, soft, brittle, elastic, or fluid-like objects animated appropriately?",
103
+ "Does the video avoid material behavior that contradicts the scene?",
104
+ ],
105
+ "buoyancy": [
106
+ "Do objects sink or float in a way consistent with apparent density?",
107
+ "Does the floating or sinking behavior stay stable over time?",
108
+ "Does the video avoid unsupported hovering or impossible underwater motion?",
109
+ ],
110
+ "displacement": [
111
+ "Does liquid level rise when volume is added or an object enters it?",
112
+ "Does overflow happen only when the container is plausibly full?",
113
+ "Does the liquid volume remain visually plausible?",
114
+ ],
115
+ "flow_dynamics": [
116
+ "Does liquid flow along surfaces, spread, or drain naturally?",
117
+ "Does the flow direction follow gravity and boundaries?",
118
+ "Does the video avoid abrupt stops, reversals, or unsupported uphill flow?",
119
+ ],
120
+ "boundary_interaction": [
121
+ "Does liquid react plausibly when hitting a wall, floor, container, or obstacle?",
122
+ "Are splash, rebound, or split patterns locally plausible?",
123
+ "Does the liquid remain consistent after interacting with boundaries?",
124
+ ],
125
+ "fluid_continuity": [
126
+ "Does liquid avoid disappearing or appearing without cause?",
127
+ "Does the amount of liquid remain broadly consistent?",
128
+ "Are splashes and separations temporary and physically plausible?",
129
+ ],
130
+ "reflection": [
131
+ "Does the reflection match nearby objects, colors, and motion?",
132
+ "Does the reflected content stay spatially consistent with the scene?",
133
+ "Does the video avoid unrelated or impossible reflection content?",
134
+ ],
135
+ "shadow": [
136
+ "Are shadows consistent with the apparent light source direction?",
137
+ "Do shadows move with the objects that cast them?",
138
+ "Does the video avoid missing, detached, or contradictory shadows?",
139
+ ],
140
+ }
141
+
142
+
143
+ def load_json(path: Path) -> dict[str, Any]:
144
+ with path.open() as f:
145
+ return json.load(f)
146
+
147
+
148
+ def load_yaml(path: Path) -> dict[str, Any]:
149
+ with path.open() as f:
150
+ return yaml.safe_load(f)
151
+
152
+
153
+ def questions_block(questions: list[str]) -> str:
154
+ return "\n".join(f"{idx}. {question}" for idx, question in enumerate(questions, 1))
155
+
156
+
157
+ def build_prompt(
158
+ cfg: dict[str, Any],
159
+ caption: str,
160
+ *,
161
+ metric: str | None = None,
162
+ law: str | None = None,
163
+ criteria: str | None = None,
164
+ ) -> tuple[str, str, str]:
165
+ if metric:
166
+ if metric not in GENERAL_SUB_QUESTIONS:
167
+ raise ValueError(f"unknown metric: {metric}")
168
+ prompt = cfg["eval_prompts"][metric].format(
169
+ prompt=caption,
170
+ questions_block=questions_block(GENERAL_SUB_QUESTIONS[metric]),
171
+ )
172
+ return cfg["system_prompt"], prompt, metric
173
+
174
+ if not law:
175
+ raise ValueError("either --metric or --law is required")
176
+ if law not in PHYSICAL_CRITERIA:
177
+ raise ValueError(f"unknown law: {law}")
178
+ prompt = cfg["physical_template"].format(
179
+ prompt=caption,
180
+ law=law,
181
+ criteria=criteria or PHYSICAL_CRITERIA[law],
182
+ questions_block=questions_block(PHYSICAL_SUB_QUESTIONS[law]),
183
+ )
184
+ return cfg["system_prompt"], prompt, law
185
+
186
+
187
+ def load_base_model(base_id: str, dtype: torch.dtype, device_map: str):
188
+ errors: list[str] = []
189
+ for class_name in (
190
+ "AutoModelForImageTextToText",
191
+ "AutoModelForVision2Seq",
192
+ "AutoModelForCausalLM",
193
+ ):
194
+ try:
195
+ module = __import__("transformers", fromlist=[class_name])
196
+ model_cls = getattr(module, class_name)
197
+ return model_cls.from_pretrained(
198
+ base_id,
199
+ torch_dtype=dtype,
200
+ device_map=device_map,
201
+ trust_remote_code=True,
202
+ )
203
+ except Exception as exc: # pragma: no cover - depends on local transformers version
204
+ errors.append(f"{class_name}: {exc}")
205
+ raise RuntimeError("failed to load base model:\n" + "\n".join(errors))
206
+
207
+
208
+ def resolve_adapter_dir(source: str) -> Path:
209
+ """Return a local directory holding the adapter files.
210
+
211
+ If ``source`` is a directory containing ``adapter_config.json`` it is used
212
+ as-is. Otherwise ``source`` is interpreted as a HF Hub repo id and the
213
+ snapshot is downloaded into the local cache.
214
+ """
215
+ candidate = Path(source)
216
+ if candidate.is_dir() and (candidate / "adapter_config.json").exists():
217
+ return candidate
218
+ try:
219
+ from huggingface_hub import snapshot_download
220
+ except ImportError as exc:
221
+ raise ImportError(
222
+ "huggingface_hub is required to fetch the adapter from the Hub. "
223
+ "Install it with: pip install huggingface_hub"
224
+ ) from exc
225
+ return Path(snapshot_download(repo_id=source))
226
+
227
+
228
+ def load_model(adapter_source: str, dtype: torch.dtype, device_map: str) -> tuple[Any, Any, Path]:
229
+ adapter_dir = resolve_adapter_dir(adapter_source)
230
+ adapter_cfg = load_json(adapter_dir / "adapter_config.json")
231
+ base_id = adapter_cfg["base_model_name_or_path"]
232
+ processor = AutoProcessor.from_pretrained(base_id, trust_remote_code=True)
233
+ base = load_base_model(base_id, dtype=dtype, device_map=device_map)
234
+ model = PeftModel.from_pretrained(base, adapter_dir)
235
+ model.eval()
236
+ return processor, model, adapter_dir
237
+
238
+
239
+ def build_messages(system_prompt: str, user_prompt: str, video_path: Path) -> list[dict[str, Any]]:
240
+ return [
241
+ {"role": "system", "content": system_prompt},
242
+ {
243
+ "role": "user",
244
+ "content": [
245
+ {"type": "video", "video": str(video_path)},
246
+ {"type": "text", "text": user_prompt},
247
+ ],
248
+ },
249
+ ]
250
+
251
+
252
+ def prepare_inputs(
253
+ processor: Any,
254
+ messages: list[dict[str, Any]],
255
+ device: torch.device,
256
+ *,
257
+ fps: float,
258
+ max_pixels: int,
259
+ ) -> dict[str, Any]:
260
+ text = processor.apply_chat_template(
261
+ messages,
262
+ tokenize=False,
263
+ add_generation_prompt=True,
264
+ )
265
+
266
+ try:
267
+ from qwen_vl_utils import process_vision_info
268
+ except ImportError as exc:
269
+ raise ImportError(
270
+ "qwen-vl-utils is required for local video inference. "
271
+ "Install it with: pip install qwen-vl-utils[decord]"
272
+ ) from exc
273
+
274
+ for msg in messages:
275
+ content = msg.get("content")
276
+ if isinstance(content, list):
277
+ for item in content:
278
+ if item.get("type") == "video":
279
+ item.setdefault("fps", fps)
280
+ item.setdefault("max_pixels", max_pixels)
281
+
282
+ try:
283
+ image_inputs, video_inputs, video_kwargs = process_vision_info(
284
+ messages,
285
+ return_video_kwargs=True,
286
+ )
287
+ except TypeError:
288
+ image_inputs, video_inputs = process_vision_info(messages)
289
+ video_kwargs = {}
290
+
291
+ inputs = processor(
292
+ text=[text],
293
+ images=image_inputs,
294
+ videos=video_inputs,
295
+ padding=True,
296
+ return_tensors="pt",
297
+ **video_kwargs,
298
+ )
299
+ return inputs.to(device)
300
+
301
+
302
+ def decode_generated(processor: Any, inputs: dict[str, Any], generated_ids: torch.Tensor) -> str:
303
+ input_len = inputs["input_ids"].shape[1]
304
+ generated_ids = generated_ids[:, input_len:]
305
+ return processor.batch_decode(
306
+ generated_ids,
307
+ skip_special_tokens=True,
308
+ clean_up_tokenization_spaces=False,
309
+ )[0].strip()
310
+
311
+
312
+ def parse_score(text: str, key: str) -> int | None:
313
+ match = re.search(r"\{.*?\}", text, flags=re.S)
314
+ if match:
315
+ try:
316
+ obj = json.loads(match.group(0))
317
+ value = obj.get(key)
318
+ if isinstance(value, int) and 1 <= value <= 5:
319
+ return value
320
+ except json.JSONDecodeError:
321
+ pass
322
+ match = re.search(rf'"?{re.escape(key)}"?\s*:\s*([1-5])', text)
323
+ if match:
324
+ return int(match.group(1))
325
+ return None
326
+
327
+
328
+ def dtype_from_name(name: str) -> torch.dtype:
329
+ if name == "bfloat16":
330
+ return torch.bfloat16
331
+ if name == "float16":
332
+ return torch.float16
333
+ if name == "float32":
334
+ return torch.float32
335
+ raise ValueError(f"unsupported dtype: {name}")
336
+
337
+
338
+ def main() -> None:
339
+ parser = argparse.ArgumentParser(description="Infer with the judge adapter.")
340
+ parser.add_argument(
341
+ "--adapter-dir",
342
+ default="NU-World-Model-Embodied-AI/phyjudge-9B",
343
+ help=(
344
+ "Local directory with adapter_config.json + adapter_model.safetensors "
345
+ "+ subq+human.yaml, or a HF Hub repo id "
346
+ "(default: NU-World-Model-Embodied-AI/phyjudge-9B)."
347
+ ),
348
+ )
349
+ parser.add_argument("--video", required=True, type=Path)
350
+ parser.add_argument("--caption", required=True)
351
+ group = parser.add_mutually_exclusive_group(required=True)
352
+ group.add_argument("--metric", choices=["SA", "PTV", "persistence"])
353
+ group.add_argument("--law", choices=sorted(PHYSICAL_CRITERIA))
354
+ parser.add_argument("--criteria", help="Override physical-law criterion text.")
355
+ parser.add_argument("--max-new-tokens", type=int, default=64)
356
+ parser.add_argument("--temperature", type=float, default=0.0)
357
+ parser.add_argument("--fps", type=float, default=2.0)
358
+ parser.add_argument("--max-pixels", type=int, default=360 * 640)
359
+ parser.add_argument("--dtype", choices=["bfloat16", "float16", "float32"], default="bfloat16")
360
+ parser.add_argument("--device-map", default="auto")
361
+ parser.add_argument("--print-prompt", action="store_true")
362
+ args = parser.parse_args()
363
+
364
+ if not args.video.is_file():
365
+ raise FileNotFoundError(args.video)
366
+
367
+ dtype = dtype_from_name(args.dtype)
368
+ processor, model, adapter_dir = load_model(
369
+ args.adapter_dir, dtype=dtype, device_map=args.device_map
370
+ )
371
+
372
+ prompt_cfg = load_yaml(adapter_dir / "subq+human.yaml")
373
+ system_prompt, user_prompt, score_key = build_prompt(
374
+ prompt_cfg,
375
+ args.caption,
376
+ metric=args.metric,
377
+ law=args.law,
378
+ criteria=args.criteria,
379
+ )
380
+
381
+ if args.print_prompt:
382
+ print("SYSTEM:")
383
+ print(system_prompt)
384
+ print("\nUSER:")
385
+ print(user_prompt)
386
+ print()
387
+ device = next(model.parameters()).device
388
+ messages = build_messages(system_prompt, user_prompt, args.video)
389
+ inputs = prepare_inputs(
390
+ processor,
391
+ messages,
392
+ device,
393
+ fps=args.fps,
394
+ max_pixels=args.max_pixels,
395
+ )
396
+
397
+ generation_kwargs: dict[str, Any] = {
398
+ "max_new_tokens": args.max_new_tokens,
399
+ "do_sample": args.temperature > 0,
400
+ "temperature": args.temperature if args.temperature > 0 else None,
401
+ }
402
+ generation_kwargs = {k: v for k, v in generation_kwargs.items() if v is not None}
403
+
404
+ with torch.inference_mode():
405
+ generated_ids = model.generate(**inputs, **generation_kwargs)
406
+
407
+ raw = decode_generated(processor, inputs, generated_ids)
408
+ score = parse_score(raw, score_key)
409
+ print(json.dumps({"key": score_key, "score": score, "raw": raw}, ensure_ascii=False, indent=2))
410
+
411
+
412
+ if __name__ == "__main__":
413
+ main()
subq+human.yaml ADDED
@@ -0,0 +1,106 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ scheme: subq_hint
2
+ description: |-
3
+ JSON-only per-task prompts with observable sub-questions/checklists. The
4
+ subq+human setting uses sub-question prompts as input and human scores as
5
+ training targets.
6
+ sub_questions:
7
+ source: static
8
+ answer_format: hint
9
+ system_prompt: You are a strict video evaluation model.
10
+ general_keys:
11
+ - SA
12
+ - PTV
13
+ - persistence
14
+ eval_prompts:
15
+ SA: |-
16
+ Evaluate Prompt Alignment (SA).
17
+
18
+ Caption:
19
+ "{prompt}"
20
+
21
+ The video was generated using a text+image-to-video (ti2v) model, conditioned on the first frame and the text prompt above.
22
+
23
+ Sub-questions to consider in your mind before scoring:
24
+ {questions_block}
25
+
26
+ Score 1-5:
27
+ 5=fully aligned
28
+ 4=mostly aligned with minor deviations
29
+ 3=partially aligned with notable gaps
30
+ 2=mostly misaligned
31
+ 1=not aligned
32
+
33
+ Then output ONLY a JSON object with exactly one key: SA.
34
+
35
+ Example:
36
+ {{"SA": 3}}
37
+ PTV: |-
38
+ Evaluate Temporal Coherence (PTV).
39
+
40
+ Caption:
41
+ "{prompt}"
42
+
43
+ The video was generated using a text+image-to-video (ti2v) model, conditioned on the first frame and the text prompt above.
44
+
45
+ Sub-questions to consider in your mind before scoring:
46
+ {questions_block}
47
+
48
+ Score 1-5:
49
+ 5=fully plausible event order
50
+ 4=mostly plausible with minor timing issues
51
+ 3=partially plausible
52
+ 2=mostly implausible
53
+ 1=completely implausible order
54
+
55
+ Then output ONLY a JSON object with exactly one key: PTV.
56
+
57
+ Example:
58
+ {{"PTV": 4}}
59
+ persistence: |-
60
+ Evaluate Object Persistence.
61
+
62
+ Caption, for context only:
63
+ "{prompt}"
64
+
65
+ The video was generated using a text+image-to-video (ti2v) model, conditioned on the first frame and the text prompt above.
66
+
67
+ Sub-questions to consider in your mind before scoring:
68
+ {questions_block}
69
+
70
+ Score 1-5:
71
+ 5=fully consistent
72
+ 4=mostly consistent with minor flicker
73
+ 3=noticeable issues
74
+ 2=major inconsistencies
75
+ 1=severe disappearance or identity changes
76
+
77
+ Then output ONLY a JSON object with exactly one key: persistence.
78
+
79
+ Example:
80
+ {{"persistence": 4}}
81
+ physical_sub_questions: true
82
+ physical_template: |-
83
+ Evaluate physical realism for one physical law: {law}.
84
+
85
+ Criterion:
86
+ {criteria}
87
+
88
+ Caption, for context only:
89
+ "{prompt}"
90
+
91
+ Sub-questions to consider in your mind before scoring:
92
+ {questions_block}
93
+
94
+ Judge the video itself. Do not penalize prompt mismatch unless it affects whether this physical law can be evaluated.
95
+
96
+ Score 1-5:
97
+ 5=clearly correct
98
+ 4=mostly correct with minor issues
99
+ 3=partially correct or ambiguous
100
+ 2=mostly incorrect
101
+ 1=severely incorrect
102
+
103
+ Then output ONLY a JSON object with exactly one key: {law}.
104
+
105
+ Example:
106
+ {{"{law}": 3}}
training_args.json ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_comment": "Sanitized excerpt of the training configuration. Local paths, tracking IDs, and base-model identity removed (see adapter_config.json for the base model required by PEFT).",
3
+ "task_type": "causal_lm",
4
+ "torch_dtype": "bfloat16",
5
+ "max_length": 8192,
6
+ "max_new_tokens": 64,
7
+ "tuner": {
8
+ "type": "lora",
9
+ "lora_rank": 32,
10
+ "lora_alpha": 64,
11
+ "lora_dropout": 0.05,
12
+ "lora_bias": "none",
13
+ "target_modules": "all-linear (language model only; vision merger limited to linear_fc1/linear_fc2)",
14
+ "use_dora": false,
15
+ "use_rslora": false,
16
+ "freeze_vit": true,
17
+ "freeze_aligner": false
18
+ },
19
+ "optimizer": {
20
+ "name": "adamw_torch_fused",
21
+ "learning_rate": 1e-4,
22
+ "weight_decay": 0.1,
23
+ "adam_beta1": 0.9,
24
+ "adam_beta2": 0.95,
25
+ "adam_epsilon": 1e-8,
26
+ "max_grad_norm": 1.0,
27
+ "lr_scheduler_type": "cosine",
28
+ "warmup_ratio": 0.05,
29
+ "aligner_lr": 2e-6
30
+ },
31
+ "training": {
32
+ "num_train_epochs": 1.0,
33
+ "per_device_train_batch_size": 1,
34
+ "gradient_accumulation_steps": 8,
35
+ "world_size": 4,
36
+ "global_batch_size": 32,
37
+ "bf16": true,
38
+ "gradient_checkpointing": true,
39
+ "seed": 42,
40
+ "data_seed": 42,
41
+ "deepspeed_zero_stage": 2,
42
+ "total_steps": 294,
43
+ "best_eval_loss": 0.1063,
44
+ "best_step": 294
45
+ },
46
+ "framework": {
47
+ "ms_swift_version": "4.1.2",
48
+ "peft_version": "0.19.1"
49
+ }
50
+ }