anonymouscla commited on
Commit
6d0e4db
·
verified ·
1 Parent(s): 4b2c01e

docs: complete inference snippet + embed rubric prompt templates

Browse files
Files changed (1) hide show
  1. README.md +217 -37
README.md CHANGED
@@ -9,26 +9,225 @@ tags:
9
  - anonymous-release
10
  ---
11
 
12
- # Anonymous Release — Judge LoRA Adapter
13
 
14
- A LoRA adapter trained as a judge model that scores generated videos against
15
- physical-law sub-rubrics derived from text prompts. Released anonymously
16
- alongside the companion dataset
17
  [`anonymouscla/physground`](https://huggingface.co/datasets/anonymouscla/physground).
18
 
19
- The base model identifier required to load this adapter is recorded in
20
- `adapter_config.json` (`base_model_name_or_path`).
 
21
 
22
  ## Files
23
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
  ```
25
- adapter_config.json # PEFT/LoRA config
26
- adapter_model.safetensors # LoRA weights (~167 MB)
27
- additional_config.json # ms-swift extras (lora_dtype / lr ratios)
28
- training_args.json # sanitized training hyperparameters
 
 
 
 
 
 
 
 
 
 
 
 
 
29
  ```
30
 
31
- ## Training
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
  | Item | Value |
34
  | --- | --- |
@@ -41,32 +240,13 @@ training_args.json # sanitized training hyperparameters
41
  | Best eval loss | 0.1063 (step 294) |
42
  | Framework | ms-swift 4.1.2, PEFT 0.19.1, DeepSpeed ZeRO-2 |
43
 
44
- The training mixture combines automatically derived sub-question judgements
45
- with human-rated samples (the `subq+human` split). See the companion
46
- anonymous dataset for prompts, physical-law tags, and example videos.
47
-
48
- ## Usage
49
-
50
- ```python
51
- import json
52
- from peft import PeftModel
53
- from transformers import AutoModelForCausalLM, AutoTokenizer
54
-
55
- adapter_dir = "." # this directory
56
- base_id = json.load(open(f"{adapter_dir}/adapter_config.json"))["base_model_name_or_path"]
57
-
58
- tokenizer = AutoTokenizer.from_pretrained(base_id)
59
- base = AutoModelForCausalLM.from_pretrained(base_id, torch_dtype="bfloat16", device_map="auto")
60
- model = PeftModel.from_pretrained(base, adapter_dir)
61
- model.eval()
62
- ```
63
-
64
- The adapter expects the base model's default chat template, with a prompt
65
- that asks the judge to answer one or more sub-rubric questions about a
66
- candidate video frame/caption. Greedy decoding (`temperature = 0`) with
67
- `max_new_tokens = 64` matches the training-time generation config.
68
 
69
  ## License
70
 
71
- The base model is released by its respective authors; this LoRA adapter is
72
- shared for anonymous review purposes. No identifying metadata is included.
 
 
9
  - anonymous-release
10
  ---
11
 
12
+ # physground-judger9B — Anonymous Judge LoRA Adapter
13
 
14
+ LoRA adapter trained as a judge model that scores generated videos against
15
+ prompt-alignment, temporal, persistence, and 13 physical-law sub-rubrics.
16
+ Released anonymously alongside the companion dataset
17
  [`anonymouscla/physground`](https://huggingface.co/datasets/anonymouscla/physground).
18
 
19
+ The base model identifier required to attach this adapter is recorded in
20
+ `adapter_config.json` (`base_model_name_or_path`); the inference script
21
+ reads it automatically.
22
 
23
  ## Files
24
 
25
+ | File | Purpose |
26
+ | --- | --- |
27
+ | `adapter_config.json` | PEFT/LoRA config (records base model id) |
28
+ | `adapter_model.safetensors` | LoRA weights (~167 MB) |
29
+ | `additional_config.json` | ms-swift extras (lora_dtype / lr ratios) |
30
+ | `training_args.json` | sanitized training hyperparameters |
31
+ | `subq+human.yaml` | prompt template used at training and inference time |
32
+ | `infer.py` | standalone end-to-end inference script |
33
+
34
+ ## Setup
35
+
36
+ ```bash
37
+ pip install "transformers>=4.49" peft accelerate pyyaml \
38
+ "qwen-vl-utils[decord]" huggingface_hub
39
+ ```
40
+
41
+ Loading the base model in bf16 needs roughly 24 GB of GPU memory.
42
+
43
+ ## Quickstart — Hugging Face Hub
44
+
45
+ `infer.py` accepts either a local folder or a HF Hub repo id via
46
+ `--adapter-dir`; the default value already points at this repo, so the
47
+ following commands work without cloning anything.
48
+
49
+ ```bash
50
+ # General axes (1–5 each): SA / PTV / persistence
51
+ python infer.py \
52
+ --video /path/to/video.mp4 \
53
+ --caption "A ball rolls down a ramp and knocks over a block." \
54
+ --metric SA
55
+
56
+ # Physical-law axes (1–5 each): one of the 13 laws below
57
+ python infer.py \
58
+ --video /path/to/video.mp4 \
59
+ --caption "A ball rolls down a ramp and knocks over a block." \
60
+ --law gravity
61
  ```
62
+
63
+ `infer.py` will:
64
+
65
+ 1. Resolve `--adapter-dir` to a local directory (`huggingface_hub.snapshot_download`
66
+ if it is a Hub id).
67
+ 2. Read `adapter_config.json` to find the base model and load it via
68
+ `transformers` (`Qwen/Qwen3.5-9B`).
69
+ 3. Attach the LoRA adapter via PEFT.
70
+ 4. Render the scoring prompt from `subq+human.yaml`, plus the relevant
71
+ sub-questions / per-law criterion (constants embedded in `infer.py`).
72
+ 5. Run greedy decoding with `--max-new-tokens 64` (matches training).
73
+ 6. Parse the JSON object and print the integer score.
74
+
75
+ Output is a single JSON line:
76
+
77
+ ```json
78
+ {"key": "gravity", "score": 4, "raw": "{\"gravity\": 4}"}
79
  ```
80
 
81
+ `--metric` choices: `SA`, `PTV`, `persistence`.
82
+ `--law` choices: `gravity`, `inertia`, `momentum`, `impenetrability`,
83
+ `collision`, `material`, `buoyancy`, `displacement`, `flow_dynamics`,
84
+ `boundary_interaction`, `fluid_continuity`, `reflection`, `shadow`.
85
+
86
+ Add `--print-prompt` to inspect the exact rendered system + user prompt
87
+ before generation.
88
+
89
+ ## Programmatic use
90
+
91
+ ```python
92
+ from pathlib import Path
93
+ import torch
94
+
95
+ from infer import (
96
+ build_messages,
97
+ build_prompt,
98
+ decode_generated,
99
+ load_model,
100
+ load_yaml,
101
+ parse_score,
102
+ prepare_inputs,
103
+ )
104
+
105
+ processor, model, adapter_dir = load_model(
106
+ "anonymouscla/physground-judger9B",
107
+ dtype=torch.bfloat16,
108
+ device_map="auto",
109
+ )
110
+ cfg = load_yaml(adapter_dir / "subq+human.yaml")
111
+
112
+ system, user, key = build_prompt(
113
+ cfg,
114
+ caption="A ball rolls down a ramp and knocks over a block.",
115
+ law="gravity",
116
+ )
117
+ messages = build_messages(system, user, Path("video.mp4"))
118
+ inputs = prepare_inputs(
119
+ processor,
120
+ messages,
121
+ next(model.parameters()).device,
122
+ fps=2.0,
123
+ max_pixels=360 * 640,
124
+ )
125
+
126
+ with torch.inference_mode():
127
+ out = model.generate(**inputs, max_new_tokens=64, do_sample=False)
128
+
129
+ raw = decode_generated(processor, inputs, out)
130
+ print({"key": key, "score": parse_score(raw, key), "raw": raw})
131
+ ```
132
+
133
+ ## Prompt templates
134
+
135
+ System prompt (all axes): `You are a strict video evaluation model.`
136
+
137
+ The prompt always asks the judge to consider observable sub-questions in
138
+ its mind and then output **only** a JSON object with one 1–5 score.
139
+
140
+ ### General axes — `subq+human.yaml`
141
+
142
+ **SA — Prompt alignment**
143
+ ```
144
+ Evaluate Prompt Alignment (SA).
145
+
146
+ Caption:
147
+ "{prompt}"
148
+
149
+ The video was generated using a text+image-to-video (ti2v) model,
150
+ conditioned on the first frame and the text prompt above.
151
+
152
+ Sub-questions to consider in your mind before scoring:
153
+ 1. Are the main objects in the caption present in the video?
154
+ 2. Are the key actions or interactions from the caption visible?
155
+ 3. Are important scene attributes and relationships preserved?
156
+ 4. Does the video avoid major contradictions to the caption?
157
+
158
+ Score 1-5: 5=fully aligned, 4=mostly aligned with minor deviations,
159
+ 3=partially aligned with notable gaps, 2=mostly misaligned,
160
+ 1=not aligned
161
+
162
+ Then output ONLY a JSON object with exactly one key: SA.
163
+ Example: {"SA": 3}
164
+ ```
165
+
166
+ **PTV — Temporal coherence** uses the same shape with sub-questions:
167
+
168
+ 1. Do causes appear before their effects?
169
+ 2. Do physical events unfold in a plausible temporal order?
170
+ 3. Are motion transitions continuous rather than abrupt jumps or loops?
171
+ 4. Does the sequence avoid impossible reversals or repeated resets?
172
+
173
+ **persistence — Object persistence** uses:
174
+
175
+ 1. Do objects maintain consistent existence throughout the video?
176
+ 2. Do objects keep a stable shape, size, color, and texture?
177
+ 3. Do objects avoid disappearing, appearing, or transforming unexpectedly?
178
+ 4. Do objects preserve identity through motion and brief occlusion?
179
+
180
+ (See `subq+human.yaml` for the verbatim PTV / persistence rubric anchors.)
181
+
182
+ ### Physical-law axes — `physical_template`
183
+
184
+ ```
185
+ Evaluate physical realism for one physical law: {law}.
186
+
187
+ Criterion:
188
+ {criteria}
189
+
190
+ Caption, for context only:
191
+ "{prompt}"
192
+
193
+ Sub-questions to consider in your mind before scoring:
194
+ {questions_block}
195
+
196
+ Judge the video itself. Do not penalize prompt mismatch unless it affects
197
+ whether this physical law can be evaluated.
198
+
199
+ Score 1-5: 5=clearly correct, 4=mostly correct with minor issues,
200
+ 3=partially correct or ambiguous, 2=mostly incorrect,
201
+ 1=severely incorrect
202
+
203
+ Then output ONLY a JSON object with exactly one key: {law}.
204
+ Example: {"{law}": 3}
205
+ ```
206
+
207
+ `{criteria}` and `{questions_block}` for each of the 13 laws are listed
208
+ below. They are also embedded in `infer.py` (`PHYSICAL_CRITERIA`,
209
+ `PHYSICAL_SUB_QUESTIONS`) so the script is a self-contained reference.
210
+
211
+ | Law | Criterion | Sub-questions |
212
+ | --- | --- | --- |
213
+ | `gravity` | Do unsupported objects fall downward? Do thrown objects follow a curved trajectory? Does poured liquid fall with gravity? | (1) Do unsupported objects or liquids move downward over time? (2) Do thrown or falling objects follow a plausible gravity-driven path? (3) Does the video avoid objects floating or rising without support? |
214
+ | `inertia` | Do stationary objects remain still unless acted upon? Do moving objects maintain their motion unless stopped by friction, collision, or an obstacle? | (1) Do stationary objects remain still unless a visible force acts on them? (2) Do moving objects continue plausibly until friction, collision, or an obstacle changes their motion? (3) Does the video avoid unexplained starts, stops, or direction changes? |
215
+ | `momentum` | After collision, push, or pull, is the direction of motion reasonable? Ignore speed magnitude. | (1) After contact, push, pull, or collision, are motion directions plausible? (2) Does the reacting object move in a direction consistent with the interaction? (3) Does the video avoid impossible reversals or unrelated motion changes? |
216
+ | `impenetrability` | Do objects maintain impenetrability — no passing through each other? | (1) Do solid objects avoid passing through one another? (2) Do contacts and overlaps remain physically plausible? (3) Does the video avoid obvious clipping or penetration artifacts? |
217
+ | `collision` | After impact, is there reasonable bounce/shatter/deformation? Does response match impact force? | (1) Does impact cause a plausible bounce, break, deformation, or transfer of motion? (2) Is the response direction consistent with the collision? (3) Does the response avoid being much too weak, too strong, or unrelated to the impact? |
218
+ | `material` | Does each material respond according to its properties? (glass shatters, rubber bounces, metal is rigid, cloth deforms softly, etc.) | (1) Do objects respond consistently with their apparent material? (2) Are rigid, soft, brittle, elastic, or fluid-like objects animated appropriately? (3) Does the video avoid material behavior that contradicts the scene? |
219
+ | `buoyancy` | Do dense objects sink? Do wood/plastic float? | (1) Do objects sink or float in a way consistent with apparent density? (2) Does the floating or sinking behavior stay stable over time? (3) Does the video avoid unsupported hovering or impossible underwater motion? |
220
+ | `displacement` | When you add more liquid or put an object into it, does the liquid level rise in a realistic way? Does it overflow when full? | (1) Does liquid level rise when volume is added or an object enters it? (2) Does overflow happen only when the container is plausibly full? (3) Does the liquid volume remain visually plausible? |
221
+ | `flow_dynamics` | Does the liquid's overall motion behave realistically over time — flowing along surfaces, spreading, draining naturally? | (1) Does liquid flow along surfaces, spread, or drain naturally? (2) Does the flow direction follow gravity and boundaries? (3) Does the video avoid abrupt stops, reversals, or unsupported uphill flow? |
222
+ | `boundary_interaction` | When the liquid hits a boundary such as a rock face, container wall, or floor, does it respond realistically? Do local splash, rebound, or split patterns on impact look physically plausible? | (1) Does liquid react plausibly when hitting a wall, floor, container, or obstacle? (2) Are splash, rebound, or split patterns locally plausible? (3) Does the liquid remain consistent after interacting with boundaries? |
223
+ | `fluid_continuity` | Does the liquid avoid disappearing or appearing out of nowhere? Small splashes that briefly break apart are okay. | (1) Does liquid avoid disappearing or appearing without cause? (2) Does the amount of liquid remain broadly consistent? (3) Are splashes and separations temporary and physically plausible? |
224
+ | `reflection` | Does the reflection roughly match objects and colors in the scene, and avoid completely unrelated content? | (1) Does the reflection match nearby objects, colors, and motion? (2) Does the reflected content stay spatially consistent with the scene? (3) Does the video avoid unrelated or impossible reflection content? |
225
+ | `shadow` | Are shadow directions consistent with light source? Do shadows move with objects? | (1) Are shadows consistent with the apparent light source direction? (2) Do shadows move with the objects that cast them? (3) Does the video avoid missing, detached, or contradictory shadows? |
226
+
227
+ Pass `--criteria "..."` to override a per-law criterion at inference time
228
+ without editing the YAML or script.
229
+
230
+ ## Training summary
231
 
232
  | Item | Value |
233
  | --- | --- |
 
240
  | Best eval loss | 0.1063 (step 294) |
241
  | Framework | ms-swift 4.1.2, PEFT 0.19.1, DeepSpeed ZeRO-2 |
242
 
243
+ The training mixture combines automatically derived sub-question
244
+ judgements with human-rated samples (the `subq+human` split). See the
245
+ companion anonymous dataset for prompts, physical-law tags, and example
246
+ videos.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
247
 
248
  ## License
249
 
250
+ The base model is released by its respective authors; this LoRA adapter
251
+ is shared for anonymous review purposes. No identifying metadata is
252
+ included.